Part VI Implementation Aspects Spring 2005 Parallel Processing, Implementation Aspects

advertisement
Part VI
Implementation Aspects
Spring 2005
Parallel Processing, Implementation Aspects
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Introduction to Parallel Processing: Algorithms and Architectures
(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by
the author in connection with teaching the graduate-level course
ECE 254B: Advanced Computer Architecture: Parallel Processing,
at the University of California, Santa Barbara. Instructors can use
these slides in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
First
Spring 2005
Released
Revised
Revised
Spring 2005
Parallel Processing, Implementation Aspects
Slide 2
VI Implementation Aspects
Study real parallel machines in MIMD and SIMD classes:
• Examine parallel computers of historical significance
• Learn from modern systems and implementation ideas
• Bracket our knowledge with history and forecasts
Topics in This Part
Chapter 21 Shared-Memory MIMD Machines
Chapter 22 Message-Passing MIMD Machines
Chapter 23 Data-Parallel SIMD Machines
Chapter 24 Past, Present, and Future
Spring 2005
Parallel Processing, Implementation Aspects
Slide 3
21 Shared-Memory MIMD Machines
Learn about practical shared-variable parallel architectures:
• Contrast centralized and distributed shared memories
• Examine research prototypes and production machines
Topics in This Chapter
21.1 Variations in Shared Memory
21.2 MIN-Based BBN Butterfly
21.3 Vector-Parallel Cray Y-MP
21.4 Latency-Tolerant Tera MTA
21.5 CC-NUMA Stanford DASH
21.6 SCI-Based Sequent NUMA-Q
Spring 2005
Parallel Processing, Implementation Aspects
Slide 4
21.1 Variations in Shared Memory
Single Copy of
Modifiable Data
Central
Main
Memory
Distributed
Main
Memory
Multiple Copies of
Modifiable Data
UMA
BBN Butterfly
Cray Y-MP
CC-UMA
NUMA
COMA
CC-NUMA
Tera MTA
Stanford DASH
Sequent NUMA-Q
Fig. 21.1 Classification of shared-memory hardware architectures
and example systems that will be studied in the rest of this chapter.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 5
C.mmp: A Multiprocessor ofMemory
Historical Significance
modules
Mem
0
Cache Map
Mem
1
Proc
Cache Map
1
16  16
crossbar
.
.
.
Mem
15
Proc
0
.
.
.
Cache Map
BI
Clock
BI
Bus
interface
Proc
15
Bus
control
BI
Fig. 21.2 Organization of the C.mmp multiprocessor.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 6
Shared Memory Consistency Models
Varying latencies makes each processor’s view of the memory different
Sequential consistency (strictest and most intuitive); mandates that the
interleaving of reads and writes be the same from the viewpoint of all
processors. This provides the illusion of a FCFS single-port memory.
Processor consistency (laxer); only mandates that writes be observed
in the same order by all processors. This allows reads to overtake writes,
providing better performance due to out-of-order execution.
Weak consistency separates ordinary memory accesses from synch
accesses that require memory to become consistent. Ordinary read and
write accesses can proceed as long as there is no pending synch
access, but the latter must wait for all preceding accesses to complete.
Release consistency is similar to weak consistency, but recognizes two
synch accesses, called “acquire” and “release”, that sandwich protected
shared accesses. Ordinary read / write accesses can proceed only when
there is no pending acquire access from the same processor and a
release access must wait for all reads and writes to be completed.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 7
21.2 MIN-Based BBN Butterfly
Processing Node 0
MC 68000
Proces sor
Proces sor Node
Controller
Memory Manager
44
44
44
44
44
44
44
44
Switch
Interface
EPROM
1 MB
Memory
To I/O
Boards
Daughter Board Connection
for Memory Expans ion (3 MB)
Processing Node 15
Fig. 21.3 Structure
of a processing node
in the BBN Butterfly.
Spring 2005
Fig. 21.4 A small 16-node version
of the multistage interconnection
network of the BBN Butterfly.
Parallel Processing, Implementation Aspects
Slide 8
21.3 Vector-Parallel Cray Y-MP
InterProcessor
Commun.
Fig. 21.5 Key
elements of
the Cray Y-MP
processor.
Address
registers,
address
function units,
instruction
buffers, and
control not
shown.
Vector
Registers
V7
V6
Add
Shift
Logic
Weight/
Parity
V5
V4
V3
V2
64 bit
V1
0
1
2
3
FloatingPoint
Add
Units Multiply
.
.
.
Reciprocal
Approx.
V0
Central
Memory
64 bit
62
63
Scalar
Integer
Units
T Regis ters
(8 64-bit)
Input/Output
Spring 2005
Vector
Integer
Units
CPU
S Regis ters
(8 64-bit)
From address
registers/units
Parallel Processing, Implementation Aspects
Add
Shift
Logic
Weight/
Parity
32 bit
Slide 9
Cray Y-MP’s Interconnection Network
Sections
P0
Subsections
1 8
44
Memory Banks
0, 4, 8, ... , 28
32, 36, 40, ... , 92
8 8
P1
44
P2
4 4
1, 5, 9, ... , 29
8 8
P3
44
2, 6, 10, ... , 30
P4
44
8 8
P5
44
P6
4 4
3, 7, 11, ... , 31
8 8
P7
4 4
227, 231, ... , 255
Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 10
21.4 Latency-Tolerant Tera MTA
Fig. 21.7
The instruction
execution
pipelines of
Tera MTA.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 11
21.5 CC-NUMA Stanford DASH
Processor
cluster
Processor
cluster
Request mesh
Reply mesh
Main memory
Level-2 cache
I-cache D-cache
Processor
Processor
cluster
Fig. 21.8
Spring 2005
Processor
cluster
Directory
& network
interface
Wormhole routers
The architecture of Stanford DASH.
Parallel Processing, Implementation Aspects
Slide 12
21.6 SCI-Based Sequent NUMA-Q
Quad
533 MB/s local bus
(64 bits, 66 MHz)
P6
Fig. 21.9 The
physical placement
of Sequent’s quad
components on a
rackmount
baseboard (not to
scale)
P6
1 GB/s Interquad SCI link
IQ-Link
Orion PCI bridge
OPB
100 MB/s Fibre Channel
PCI/FC
4 x 33 MB/s
PCI buses
P6
OPB
PCI/LAN
Public
LAN
P6
Pentium Pro
(P6) proc’s
Memory
IQ-Plus
(SCI
ring)
MDP
Quad
.
.
.
Quad
Peripheral
bay
IQ-link
Spring 2005
Fig. 21.10
The architecture
of Sequent
NUMA-Q 2000.
Private
Ethernet
SCSI
Bridge
Console
Parallel Processing, Implementation Aspects
Bridge
SCSI
Peripheral
bay
Slide 13
Details
of Network-side
the IQ-Link in Sequent NUMA-Q
Bus-side
P6
bus
snooping tags
tags
Local
Remote
directory
tags
Local
Remote
directory
tags
Fig. 21.11
Block diagram of
the IQ-Link board.
SCI in
Bus interface
controller
Remote
cache data
Directory
controller
Interconnect
controller
32 MB
SCI out
SCI in
Assemble
Fig. 21.12
Block diagram
of IQ-Link’s
interconnect
controller.
Spring 2005
Request
receive Q
Stripper
Response
receive Q
Disassemble
Request
send Q
Elastic
buffer
Response
send Q
Parallel Processing, Implementation Aspects
Bypass
FIFO
SCI out
Slide 14
22 Message-Passing MIMD Machines
Learn about practical message-passing parallel architectures:
• Study mechanisms that support message passing
• Examine research prototypes and production machines
Topics in This Chapter
22.1 Mechanisms for Message Passing
22.2 Reliable Bus-Based Tandem NonStop
22.3 Hypercube-Based nCUBE3
22.4 Fat-Tree-Based Connection Machine 5
22.5 Omega-Network-Based IBM SP2
22.6 Commodity-Driven Berkeley NOW
Spring 2005
Parallel Processing, Implementation Aspects
Slide 15
22.1 Mechanisms for Message Passing
1. Shared-Medium networks (buses, LANs; e.g., Ethernet, token-based)
2. Router-based networks (aka direct networks; e.g., Cray T3E, Chaos)
3. Switch-based networks (crossbars, MINs; e.g., Myrinet, IBM SP2)
Crosspoint switch
Inputs
Outputs
Through
(straight)
Crossed
(exchange)
Lower
broadcast
Upper
broadcast
Fig. 22.2 Example 4  4 and 2  2 switches
Fig. 22.1
used
as building
The structure
blocks foroflarger
a generic
networks.
router.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 16
Router-Based Networks
Injection
Channel
Ejection
Channel
LC
LC
Link Controller
Q
Q
Message Queue
Routing/Arbitration
Output Queue
Input Queue
LC
Q
Input
Channels
LC
LC
Q
LC
Switch
LC
Q
Q
LC
LC
Q
Q
LC
Fig. 22.1
Spring 2005
Q
Q
Output
Channels
The structure of a generic router.
Parallel Processing, Implementation Aspects
Slide 17
Case Studies of Message-Passing Machines
Fig. 22.3
Classification of
message-passing
hardware
architectures and
example systems
that will be
studied in this
chapter.
Shared-Medium
Network
CoarseGrain
Router-Based
Network
Switch-Based
Network
Tandem NonStop
(Bus)
MediumBerkeley NOW
Grain
(LAN)
nCUBE3
TMC CM-5
IBM SP2
FineGrain
Spring 2005
Parallel Processing, Implementation Aspects
Slide 18
22.2 Reliable Bus-Based Tandem NonStop
Dynabus
Proces so r
an d
Memo ry
I/O I/O
Proces so r
an d
Memo ry
I/O I/O
Proces so r
an d
Memo ry
I/O I/O
Proces so r
an d
Memo ry
I/O I/O
I/O
Contro ller
Fig. 22.4
Spring 2005
One section of the Tandem NonStop Cyclone system.
Parallel Processing, Implementation Aspects
Slide 19
High-Level Structure of the Tandem NonStop
Four-processor
sections
Fig. 22.5 Four
four-processor
sections
interconnected
by Dynabus+.
Dynabus+
Spring 2005
Parallel Processing, Implementation Aspects
Slide 20
22.3 Hypercube-Based nCUBE3
I/O
I/O
Host
I/O
I/O
Host
Spring 2005
Parallel Processing, Implementation Aspects
Fig. 22.6
An eight-node
nCUBE
architecture.
Slide 21
22.4 Fat-Tree-Based Connection Machine 5
Control
Network
Data
Network
P
P
P
M
M
M
. . .
P
P
CP
CP
M
M
M
M
Up to 16K
Processing Nodes
Fig. 22.7
Spring 2005
Diagnostic
Network
...
CP
I/O
M
I/O
Data
Vault
One or more
Control Processors
The overall structure of CM-5.
HIPPI or VME
Interface
Parallel Processing, Implementation Aspects
Graphics
Output
Slide 22
Processing
Nodes
in CM-5
Memory
Memory
Memory
Memory
Vector Unit
Pipelined
ALU
Vector
Unit
Vector
Unit
64  64
Register
File
Vector
Unit
Memory
Control
Ins tr. Decode
Bus Interface
64-Bit Bus
Fig. 22.8
The components
of a processing
node in CM-5.
SPARC
Processor
Network
Interface
Data
Network
Spring 2005
Parallel Processing, Implementation Aspects
Control
Network
Slide 23
The Interconnection Network in CM-5
Data Routers
(4 2 Switches)
01
23
Fig. 22.9 The fat-tree
(hyper-tree) data network
of CM-5.
Processors
and I/O Nodes
63
Spring 2005
Parallel Processing, Implementation Aspects
Slide 24
22.5 Omega-Network-Based IBM SP2
Ethernet
Processor
P
Ethernet
adapter
E
Micro Channel
Controller
Disk
Console
MCC
Compute
nodes
. . .
C
C
I/O bus
M
NIC
Memory
Netw ork
interface
To other
systems
G
Gateways
High-performance switch
G
I/O
I/O
. . .
I/O
Input/output nodes
Fig. 22.10
Spring 2005
I/O
H
. . .
H
Host nodes
The architecture of IBM SP series of systems.
Parallel Processing, Implementation Aspects
Slide 25
The IBM SP2 Network Interface
Micro
Channel
Micro Channel interface
Left
DMA
2 x 2 KB
FIFO
buffers
Right
DMA
40 MHz
i860
processor
8-MB
DRAM
160 MB/s i860 bus
Memory and switch
management unit
Input
FIFO
Output
FIFO
Fig. 22.11 The network
interface controller of IBM SP2.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 26
The Interconnection Network of IBM SP2
4x4
crossbars
Only 1/4 of links at
this level are shown
012 3 . . . .
Fig. 22.12
Spring 2005
63
A section of the high-performance switch network of IBM SP2.
Parallel Processing, Implementation Aspects
Slide 27
22.6 Commodity-Driven Berkeley NOW
NOW aims for building a distributed supercomputer from commercial
workstations (high end PCs), linked via switch-based networks
Components and challenges for successful deployment of NOWs
(aka clusters of workstations, or COWs)
Network interface hardware: System-area network (SAN), middle ground
between LANs and specialized interconnections
Fast communication protocols: Must provide competitive parameters for
the parameters in the LogP model (see Section 4.5)
Distributed file systems: Data files must be distributed among nodes,
with no central warehouse or arbitration authority
Global resource management: Berkeley uses GLUnix (global-layer Unix);
other systems of comparable functionalities include Argonne Globus,
NSCP metacomputing system, Princeton SHRIMP, Rice TreadMarks,
Syracuse WWVM, Virginia Legion, Wisconsin Wind Tunnel
Spring 2005
Parallel Processing, Implementation Aspects
Slide 28
23 Data-Parallel SIMD Machines
Learn about practical implementation of SIMD architectures:
• Assess the successes, failures, and potentials of SIMD
• Examine various SIMD parallel computers, old and new
Topics in This Chapter
23.1 Where Have All the SIMDs Gone?
23.2 The First Supercomputer: ILLIAC IV
23.3 Massively Parallel Goodyear MPP
23.4 Distributed Array Processor (DAP)
23.5 Hypercubic Connection Machine 2
23.6 Multiconnected MasPar MP-2
Spring 2005
Parallel Processing, Implementation Aspects
Slide 29
23.1 Where Have All the SIMDs Gone?
Associative or
content-addressable
memories/processors
constituted early
forms of SIMD
parallel processing
Fig. 23.1
Functional view of
an associative
memory/processor.
Spring 2005
Control
Unit
Global Operations Control & Response
Read
Lines
Comparand
Mask
Data and
Commands
Broadcast
Response
Store
(Tags)
Cell 0
t0
Cell 1
t1
Cell 2
t2
.
.
.
.
.
.
Cell m–1
t m–1
Parallel Processing, Implementation Aspects
Global Tag
Operations
Unit
.
.
.
Slide 30
Search Functions in Associative Devices
Exact match: Locating data based on partial knowledge of contents
Inexact match: Finding numerically or logically proximate values
Membership: Identifying all members of a specified set
Relational: Determining values that are less than, less than or equal, etc.
Interval: Marking items that are between or outside given limits
Extrema: Finding the maximum, minimum, next higher, or next lower
Rank-based: Selecting kth or k largest/smallest elements
Ordered retrieval: Repeated max- or min-finding with elimination (sorting)
Spring 2005
Parallel Processing, Implementation Aspects
Slide 31
Mixed-Mode SIMD / MIMD Parallelism
Memory
Man agement
Sys tem
Sys tem Co ntrol Unit
Contro l
Storage
Mid-Level Co ntrollers
Proc.
0
Proc.
1
Proc.
2
...
Parallel
Computation
Unit
Proc.
p– 1
Interco nnectio n
Netwro k
A B
A B
A B
...
Memory Storage Sys tem
Spring 2005
A B
Memory
Modu les
Fig. 23.2 The architecture
of Purdue PASM.
Parallel Processing, Implementation Aspects
Slide 32
23.2 The First Supercomputer: ILLIAC IV
Data
Mode
To Proc. 63
Host
(B6700)
Contro l Unit
Fetch data or
ins tructions
Proc.
0
Proc.
1
Proc.
2
Mem.
0
Mem.
1
Mem.
2
Contro l
...
...
Proc.
63
To Proc. 0
Mem.
63
Fig. 23.3 The
ILLIAC IV computer
(the inter-processor
routing network is
only partially shown).
Syn chro nized
Disks
(Main Memo ry)
3
2
1
0
299
298
Ban d 0
Spring 2005
3
2
1
0
299
298
Ban d 1
3
2
1
0
299
298
Ban d 2
Parallel Processing, Implementation Aspects
...
3
2
1
0
299
298
Ban d 51
Slide 33
23.3 Massively Parallel Goodyear MPP
Tape
Disk
Printer
Terminal
Network
Host processor
(VA X 11/780)
Program and data management unit
Staging
memory
Control
Switches
128-bit
input
interface
Fig. 23.4
Spring 2005
Staging
memory
Arra y control
Status
Array unit
128  128
processors
128-bit
output
interface
The architecture of Goodyear MPP.
Parallel Processing, Implementation Aspects
Slide 34
Processing Elements in the Goodyear MPP
From left
proces sor
To right
proces sor
S
Mas k
G
Comparator
To global
OR tree
To/from
neighbors
NEWS
C
P
Full Carry
adder Sum
Logic
A
Memory
Fig. 23.5
The single-bit
processor of
MPP.
Data
bus
Addres s
Spring 2005
Variablelength
shift
register
Parallel Processing, Implementation Aspects
B
Slide 35
23.4 Distributed Array Processor (DAP)
To n eig hboring
proces so rs
N
E
W
S
Condition
A
From
neighb oring
proces so rs
From
co ntrol
unit
Fig. 23.6
The bit-serial
processor of
DAP.
Spring 2005
{
{
NN
EE
SS
WW
Row
Col
C
Mux
Q
To ro w/col
res pon ses
Full Carry
ad der Sum
Mux
Memory
D
S
From s outh neighbo r
Parallel Processing, Implementation Aspects
To n orth n eig hbor
Slide 36
DAP’s High-Level Structure
N
W
Column j
E
S
Program
memory
Row i
Master
control
unit
Processors
Q plane
C plane
Fast I/O
Register Q in
processor ij
A plane
D plane
Host
interface
unit
One plane
of memory
Host
work station
Array memory
(at least
32K planes)
Spring 2005
Local memory
for processor ij
Fig. 23.7 The high-level
architecture of DAP system.
Parallel Processing, Implementation Aspects
Slide 37
23.5 Hypercubic Connection Machine 2
Nexus
3
2
Bus
Interface
3
2
Front
End 0
1
Fig. 23.8
The architecture
of CM-2.
Data Vault
1
Sequencer 0
16K
Processors
Spring 2005
Data Vault
Data Vault
I/O
System
Graphic
Display
Parallel Processing, Implementation Aspects
Slide 38
The Simple Processing Elements of CM-2
From
Memory
f opcode
a
b
c
0
1
2
3
4
5
6
7
Mux
g opcode
Fig. 23.9
Spring 2005
Flags
f(a, b, c)
0
1
2
3
4
5
6
7
Mux
g(a, b, c)
To Memory
The bit-serial ALU of CM-2.
Parallel Processing, Implementation Aspects
Slide 39
23.6 Multiconnected MasPar MP-2
Ethernet
Front End
Disk Array
Array Control Unit
Processor Array
(X-net connected)
I/O Channel
Controller
& Memory
Fig. 23.10
The architecture
of MasPar MP-2.
Global Router
Spring 2005
Parallel Processing, Implementation Aspects
Slide 40
The Interconnection Network of MasPar MP-2
Processor
cluster
. . .
Stage
3
. . .
.
.
.
.
.
.
.
.
.
.
.
.
. . .
.
.
.
Stage
2
Stage
1
Fig. 23.11 The physical packaging of processor clusters
and the 3-stage global router in MasPar MP-2 .
Spring 2005
Parallel Processing, Implementation Aspects
Slide 41
The Processing Nodes of MasPar MP-2
X-net
In
Comm.
Inp ut
X-net
Out
Comm.
Output
To Ro uter
Stage 1
To Ro uter
Stage 3
Flags
Barrel
Shifter
Sig nifican d
Unit
Expon ent
Un it
ALU
Log ic
Word Bu s
32 bits
Bit Bu s
Fig. 23.12
Processor
architecture in
MasPar MP-2.
Memory
Address
Un it
Memo ry
Data/ECC
Unit
Reg ister
File
Control
Red uction
External Memory
Ins truction
Spring 2005
Parallel Processing, Implementation Aspects
Broadcast
Slide 42
24 Past, Present, and Future
Put the state of the art in context of history and future trends:
• Quest for higher performance, and where it is leading us
• Interplay between technology progress and architecture
Topics in This Chapter
24.1 Milestones in Parallel Processing
24.2 Current Status, Issues, and Debates
24.3 TFLOPS, PFLOPS, and Beyond
24.4 Processor and Memory Technologies
24.5 Interconnection Technologies
24.6 The Future of Parallel Processing
Spring 2005
Parallel Processing, Implementation Aspects
Slide 43
24.1 Milestones in Parallel Processing
1840s Desirability of computing multiple values at the same time was
noted in connection with Babbage’s Analytical Engine
1950s von Neumann’s and Holland’s cellular computers were proposed
1960s The ILLIAC IV SIMD machine was designed at U Illinois based
on the earlier SOLOMON computer
1970s Vector machines, with deeply pipelined function units, emerged
and quickly dominated the supercomputer market; early experiments
with shared memory multiprocessors, message-passing multicomputers,
and gracefully degrading systems paved the way for further progress
1980s Fairly low-cost and massively parallel supercomputers, using
hypercube and related interconnection schemes, became available
1990s Massively parallel computers with mesh or torus interconnection,
and using wormhole routing, became dominant
2000s Commodity and network-based parallel computing took hold,
offering speed, flexibility, and reliability for server farms and other areas
Spring 2005
Parallel Processing, Implementation Aspects
Slide 44
24.2 Current Status, Issues, and Debates
Design choices and key debates:
Related user concerns:
Architecture
General- or special-purpose system?
SIMD, MIMD, or hybrid?
Cost per MIPS
Scalability, longevity
Interconnection
Shared-medium, direct, or multistage?
Custom or commodity?
Cost per MB/s
Expandability, reliability
Routing
Oblivious or adaptive?
Wormhole, packet, or virtual cut-through?
Speed, efficiency
Overhead, saturation limit
Programming
Shared memory or message passing?
New languages or libraries?
Ease of programming
Software portability
Spring 2005
Parallel Processing, Implementation Aspects
Slide 45
24.3 TFLOPS, PFLOPS, and Beyond
Performance (TFLOPS)
1000
Plan
Develop
Use
100+ TFLOPS, 20 TB
ASCI Purple
100
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
ASCI White
10
ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1
1995
ASCI Red
2000
2005
2010
Calendar year
Fig. 24.1 Milestones in the Accelerated Strategic Computing
Initiative (ASCI) program, sponsored by the US Department of
Energy, with extrapolation up to the PFLOPS level.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 46
24.4 Processor and Memory Technologies
Dedicated to
memory access
(address
generation
units, etc)
Port-0
Units
Port 2
86
Port 3
Port 4
Port 0
86
86
Reservation
Station
Reorder
Buffer and
Retirement
Register
File
Shift
FLP Mult
FLP Div
Integer Div
FLP Ad d
Integer
Executio n
Unit 0
Port-1
Units
Port 1
Jump
Exec
Integer
Executio n Unit
Unit 1
Fig. 24.2 Key parts of the CPU in the Intel Pentium Pro microprocessor.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 47
Ratio of wire delay to switching time
24.5 Interconnection Technologies
8
7
Continued use
of Al/Oxide
6
Changeover to lower-resis tivity
wires & low-dielectric-constant
insulator; e.g., Cu/Polyimide
5
4
0.5
0.35
0.25
0.18
0.13
Feature size (m)
Fig. 24.3 Changes in the ratio of a 1-cm wire delay to
device switching time as the feature size is reduced.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 48
Intermodule and Intersystem Connection Options
10 12
Bandwidth (b/s)
Processor
bus
Geographically distributed
I/O
network
System-area
network
(SAN)
Local-area
network
(LAN)
10 9
Metro-area
network
(MAN)
10 6
Same geographic location
10 3
10 9
(ns)
10 6
(s)
10 3
(ms)
1
Wide-area
network
(WAN)
(min)
10 3
(h)
Latency (s)
Fig. 24.4
Spring 2005
Various types of intermodule and intersystem connections.
Parallel Processing, Implementation Aspects
Slide 49
System Interconnection Media
Twisted
pair
Plastic
Insulator
Copper
core
Coaxial
cable
Outer
conductor
Reflection
Silica
Light
source
Optical
fiber
Fig. 24.5 The three commonly used media for
computer and network connections.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 50
24.6 The Future of Parallel Processing
In the early 1990s, it was unclear whether the TFLOPS performance
milestone would be reached with a thousand or so GFLOPS nodes or
with a million or so simpler MFLOPS processors
Now, a similar question is in front of us for PFLOPS performance
Asynchronous design for its speed and greater energy economy:
Because pure asynchronous circuits create some difficulties, designs
that are locally synchronous and globally asynchronous are flourishing
Intelligent memories to help remove the memory wall:
Memory chips have high internal bandwidths, but are limited by pins in
how much data they can handle per clock cycle
Reconfigurable systems for adapting hardware to application needs:
Raw hardware that can be configured via “programming” to adapt to
application needs may offer the benefits of special-purpose processing
Network and grid computing to allow flexible, ubiquitous access:
Resource of many small and large computers can be pooled together
via a network, offering supercomputing capability to anyone
Spring 2005
Parallel Processing, Implementation Aspects
Slide 51
Parallel Processing for Low-Power Systems
TIPS
DSP performance
per watt
Absolute
proce ssor
performance
Performance
GIPS
GP processor
performance
per watt
MIPS
kIPS
1980
1990
2000
2010
Calendar year
Figure 25.1 of Parhami’s computer architecture textbook.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 52
Peak Performance of Supercomputers
PFLOPS
Earth Simulator
 10 / 5 years
ASCI White Pacific
ASCI Red
TFLOPS
TMC CM-5
Cray X-MP
Cray T3D
TMC CM-2
Cray 2
GFLOPS
1980
1990
2000
2010
Dongarra, J., “Trends in High Performance Computing,”
Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
Spring 2005
Parallel Processing, Implementation Aspects
Slide 53
Evolution of Computer Performance/Cost
From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.
Mental power in four scales
Spring 2005
Parallel Processing, Implementation Aspects
Slide 54
ABCs of Parallel Processing in One Slide
A
Amdahl’s Law (Speedup Formula)
Bad news – Sequential overhead will kill you, because:
Speedup = T1/Tp  1/[f + (1 – f)/p]  min(1/f, p)
Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.
B
Brent’s Scheduling Theorem
Good news – Optimal scheduling is very difficult, but even a naive
scheduling algorithm can ensure:
T1/p  Tp  T1/p + T = (T1/p)[1 + p/(T1/T)]
Result: For a reasonably parallel task (large T1/T), or for a suitably
small p (say, p  T1/T), good speedup and efficiency are possible.
C
Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.
Spring 2005
Parallel Processing, Implementation Aspects
Slide 55
Gordon Bell Prize for High-Performance Computing
Established in 1988 by Gordon Bell, one of the pioneers in the field
Awards are given annually in several categories (there are small cash
awards, but the prestige of winning is the main motivator)
The two key categories are as follows; others have varied over time
Peak performance: The entrant must convince the judges that the
submitted program runs faster than any other comparable application
Price/Performance: The entrant must show that performance of the
application divided by the list price of the smallest system needed to
achieve the reported performance is better than that of any other entry
Different hardware/software combinations have won the competition
Supercomputer performance improvement has outpaced Moore’s law;
however, progress shows signs of slowing down due to lack of funding
In past few years, improvements have been due to faster uniprocessors
(2004 NRC Report Getting up to Speed: The Future of Supercompuing)
Spring 2005
Parallel Processing, Implementation Aspects
Slide 56
Download