ppt file - Electrical and Computer Engineering

advertisement
Part VII
Advanced Architectures
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2007
Mar. 2007
Computer Architecture, Advanced Architectures
Revised
Slide 2
VII Advanced Architectures
Performance enhancement beyond what we have seen:
• What else can we do at the instruction execution level?
• Data parallelism: vector and array processing
• Control parallelism: parallel and distributed processing
Topics in This Part
Chapter 25 Road to Higher Performance
Chapter 26 Vector and Array Processing
Chapter 27 Shared-Memory Multiprocessing
Chapter 28 Distributed Multicomputing
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 3
25 Road to Higher Performance
Review past, current, and future architectural trends:
• General-purpose and special-purpose acceleration
• Introduction to data and control parallelism
Topics in This Chapter
25.1 Past and Current Performance Trends
25.2 Performance-Driven ISA Extensions
25.3 Instruction-Level Parallelism
25.4 Speculation and Value Prediction
25.5 Special-Purpose Hardware Accelerators
25.6 Vector, Array, and Parallel Processing
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 4
25.1
Past and Current Performance Trends
Intel 4004: The first p (1971)
Intel Pentium 4, circa 2005
0.06 MIPS (4-bit processor)
10,000 MIPS (32-bit processor)
8008
8-bit
80386
8080
80486
Pentium, MMX
8084
32-bit
8086
16-bit
8088
80186
80188
Pentium Pro, II
Pentium III, M
Celeron
80286
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 5
Architectural Innovations for Improved Performance
Newer
methods
Mar. 2007
Improvement factor
1. Pipelining (and superpipelining)
3-8 √
2. Cache memory, 2-3 levels
2-5 √
3. RISC and related ideas
2-3 √
4. Multiple instruction issue (superscalar) 2-3 √
5. ISA extensions (e.g., for multimedia)
1-3 √
6. Multithreading (super-, hyper-)
2-5 ?
7. Speculation and value prediction
2-3 ?
8. Hardware acceleration
2-10 ?
9. Vector and array processing
2-10 ?
10. Parallel/distributed computing
2-1000s ?
Computer Architecture, Advanced Architectures
Previously
discussed
Established
methods
Architectural method
Available computing power ca. 2000:
GFLOPS on desktop
TFLOPS in supercomputer center
PFLOPS on drawing board
Covered in
Part VII
Computer performance grew by a factor
of about 10000 between 1980 and 2000
100 due to faster technology
100 due to better architecture
Slide 6
Peak Performance of Supercomputers
PFLOPS
Earth Simulator
 10 / 5 years
ASCI White Pacific
ASCI Red
TFLOPS
TMC CM-5
Cray X-MP
Cray T3D
TMC CM-2
Cray 2
GFLOPS
1980
1990
2000
2010
Dongarra, J., “Trends in High Performance Computing,”
Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 7
Energy Consumption is Getting out of Hand
TIPS
DSP performance
per watt
Absolute
proce ssor
performance
Performance
GIPS
GP processor
performance
per watt
MIPS
kIPS
1980
1990
2000
2010
Calendar year
Figure 25.1 Trend in energy consumption for each MIPS of
computational power in general-purpose processors and DSPs.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 8
25.2
Performance-Driven ISA Extensions
Adding instructions that do more work per cycle
Shift-add: replace two instructions with one (e.g., multiply by 5)
Multiply-add: replace two instructions with one (x := c + a  b)
Multiply-accumulate: reduce round-off error (s := s + a  b)
Conditional copy: to avoid some branches (e.g., in if-then-else)
Subword parallelism (for multimedia applications)
Intel MMX: multimedia extension
64-bit registers can hold multiple integer operands
Intel SSE: Streaming SIMD extension
128-bit registers can hold several floating-point operands
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 9
Intel
MMX
ISA
Extension
Class
Copy
Arithmetic
Shift
Logic
Table
25.1
Memory
access
Control
Mar. 2007
Instruction
Register copy
Parallel pack
Parallel unpack low
Parallel unpack high
Parallel add
Parallel subtract
Parallel multiply low
Parallel multiply high
Parallel multiply-add
Parallel compare equal
Parallel compare greater
Parallel left shift logical
Parallel right shift logical
Parallel right shift arith
Parallel AND
Parallel ANDNOT
Parallel OR
Parallel XOR
Parallel load MMX reg
Parallel store MMX reg
Empty FP tag bits
Vector
4, 2
8, 4, 2
8, 4, 2
8, 4, 2
8, 4, 2
4
4
4
8, 4, 2
8, 4, 2
4, 2, 1
4, 2, 1
4, 2
1
1
1
1
Op type
32 bits
Saturate
Wrap/Saturate#
Wrap/Saturate#
Bitwise
Bitwise
Bitwise
Bitwise
32 or 64 bits
32 or 64 bit
Computer Architecture, Advanced Architectures
Function or results
Integer register  MMX register
Convert to narrower elements
Merge lower halves of 2 vectors
Merge upper halves of 2 vectors
Add; inhibit carry at boundaries
Subtract with carry inhibition
Multiply, keep the 4 low halves
Multiply, keep the 4 high halves
Multiply, add adjacent products*
All 1s where equal, else all 0s
All 1s where greater, else all 0s
Shift left, respect boundaries
Shift right, respect boundaries
Arith shift within each (half)word
dest  (src1)  (src2)
dest  (src1)  (src2)
dest  (src1)  (src2)
dest  (src1)  (src2)
Address given in integer register
Address given in integer register
Required for compatibility$
Slide 10
MMX Multiplication and Multiply-Add
b
a e
a
b
d
e
a
b
d
e
e
f
g
h
e
f
g
h
e h
z
v
d g
y
u
f
x
t
w
s
s
Mar. 2007
d g
b
a e
t
u
(a) Parallel multiply low
Figure 25.2
e h
f
v
u
t
s
v
add
add
s +t
u+v
(b) Parallel multiply-add
Parallel multiplication and multiply-add in MMX.
Computer Architecture, Advanced Architectures
Slide 11
MMX Parallel Comparisons
14
3
58
66
5 3 12 32 5 6 12 9
79
1
58
65
3 12 22 17 5 12 90 8
65 535
(all 1s)
0
0
0
0 0
(a) Parallel compare equal
Figure 25.3
Mar. 2007
255
(all 1s)
0 0 0
(b) Parallel compare greater
Parallel comparisons in MMX.
Computer Architecture, Advanced Architectures
Slide 12
25.3
Instruction-Level Parallelism
3
Speedup attained
Fraction of cycles
30%
20%
10%
0%
0 1 2 3 4 5 6 7 8
Issuable instructions per cycle
2
1
0
(a)
2
4
6
8
Instruction issue width
(b)
Figure 25.4 Available instruction-level parallelism and the speedup
due to multiple instruction issue in superscalar processors [John91].
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 13
Instruction-Level Parallelism
Figure 25.5
Mar. 2007
A computation with inherent instruction-level parallelism.
Computer Architecture, Advanced Architectures
Slide 14
VLIW and EPIC Architectures
VLIW
EPIC
Very long instruction word architecture
Explicitly parallel instruction computing
General
registers (128)
Memory
Execution
unit
Execution
unit
...
Execution
unit
Execution
unit
Execution
unit
...
Execution
unit
Predicates
(64)
Floating-point
registers (128)
Figure 25.6 Hardware organization for IA-64. General and floatingpoint registers are 64-bit wide. Predicates are single-bit registers.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 15
25.4
Speculation and Value Prediction
spec load
------------check load
-------
------------load
-------
(a) Control speculation
Figure 25.7
Mar. 2007
------store
---load
-------
spec load
------store
---check load
-------
(b) Data speculation
Examples of software speculation in IA-64.
Computer Architecture, Advanced Architectures
Slide 16
Value Prediction
Memo table
Miss
Mult/
Div
Inputs
Mux
0
Output
1
Done
Inputs ready
Control
Output ready
Figure 25.8 Value prediction for multiplication or
division via a memo table.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 17
25.5 Special-Purpose Hardware Accelerators
Data and
program
memory
CPU
FPGA-like unit
on which
accelerators
can be formed
via loading of
configuration
registers
Configuration
memory
Accel. 1
Accel. 3
Accel. 2
Unused
resources
Figure 25.9 General structure of a processor
with configurable hardware accelerators.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 18
Graphic Processors, Network Processors, etc.
Input
buffer
Feedback
path
PE
0
PE
1
PE
2
PE
3
PE
4
PE
PE
5
5
PE
6
PE
7
PE
8
PE
9
PE
10
PE
11
PE
12
PE
13
PE
14
PE
15
Column
memory
Column
memory
Column
memory
Column
memory
Output
buffer
Figure 25.10 Simplified block diagram of Toaster2,
Cisco Systems’ network processor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 19
SISD
SIMD
Uniprocessors Array or vector
processors
MISD
Rarely used
MIMD
Multiproc’s or
multicomputers
Global
memory
Multiple data
streams
Distributed
memory
Single data
stream
Johnson’ s expansion
Multiple instr
streams
Single instr
stream
25.6 Vector, Array, and Parallel Processing
Shared
variables
Message
passing
GMSV
GMMP
Shared-memory
multiprocessors
Rarely used
DMSV
DMMP
Distributed
Distrib-memory
shared memory multicomputers
Flynn’s categories
Figure 25.11
Mar. 2007
The Flynn-Johnson classification of computer systems.
Computer Architecture, Advanced Architectures
Slide 20
SIMD Architectures
Data parallelism: executing one operation on multiple data streams
Concurrency in time – vector processing
Concurrency in space – array processing
Example to provide context
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i]  x[i], 0  i < n
Sources of performance improvement in vector processing
(details in the first half of Chapter 26)
One instruction is fetched and decoded for the entire operation
The multiplications are known to be independent (no checking)
Pipelining/concurrency in memory access as well as in arithmetic
Array processing is similar (details in the second half of Chapter 26)
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 21
MISD Architecture Example
Ins truction
streams
1-5
Data
in
Data
out
Figure 25.12 Multiple instruction streams
operating on a single data stream (MISD).
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 22
MIMD Architectures
Control parallelism: executing several instruction streams in parallel
GMSV: Shared global memory – symmetric multiprocessors
DMSV: Shared distributed memory – asymmetric multiprocessors
DMMP: Message passing – multicomputers
0
Processortoprocessor
network
0
1
.
.
.
Memories and processors
Memory modules
Processortomemory
network
p1
1
.
.
.
m1
Parallel input/output
Processors
Routers
0
1
A computing node
.
.
.
Interconnection
network
p1
...
...
Parallel I/O
Figure 27.1 Centralized shared memory.
Mar. 2007
Figure 28.1 Distributed memory.
Computer Architecture, Advanced Architectures
Slide 23
Amdahl’s Law Revisited
50
f = sequential
f =0
fraction
p = speedup
of the rest
with p
processors
Speedup (s )
40
f = 0.01
30
f = 0.02
20
f = 0.05
1
s=
f + (1 – f)/p
10
f = 0.1
0
0
10
20
30
Enhancement factor (p )
40
50
 min(p, 1/f)
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task
is unaffected and the remaining 1 – f part runs p times as fast.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 24
26 Vector and Array Processing
Single instruction stream operating on multiple data streams
• Data parallelism in time = vector processing
• Data parallelism in space = array processing
Topics in This Chapter
26.1 Operations on Vectors
26.2 Vector Processor Implementation
26.3 Vector Processor Performance
26.4 Shared-Control Systems
26.5 Array Processor Implementation
26.6 Array Processor Performance
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 25
26.1 Operations on Vectors
Sequential processor:
Vector processor:
for i = 0 to 63 do
P[i] := W[i]  D[i]
endfor
load W
load D
P := W  D
store P
for i = 0 to 63 do
X[i+1] := X[i] + Z[i]
Y[i+1] := X[i+1] + Y[i]
endfor
Mar. 2007
Unparallelizable
Computer Architecture, Advanced Architectures
Slide 26
26.2 Vector Processor Implementation
From scalar registers
To and from memory unit
Function unit 1 pipeline
Load
unit A
Load
unit B
Function unit 2 pipeline
Vector
register
file
Function unit 3 pipeline
Store
unit
Forwarding muxes
Figure 26.1
Mar. 2007
Simplified generic structure of a vector processor.
Computer Architecture, Advanced Architectures
Slide 27
Conflict-Free Memory Access
0
1
2
62
0,0
0,1
0,2
... 0,62
1,0
1,1
1,2
... 1,62
63
Bank
number
0
1
0,63
0,0
0,1
0,2 ...
0,62
0,63
1,63
1,63
1,0
1,0 ...
1,61
1,62
2,0
2,1
2,2 ... 2,62
2,63
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62,0 62,1 62,2 ... 62,62 62,63
2,62 2,63 2,0 ...
.
.
. .
.
.
.
.
.
.
.
.
62,2 62,3 62,4 ...
2,60
.
.
.
62,0
2,61
.
.
.
62,1
63,0 63,1 63,2 ... 63,62 63,63
63,1 63,2 63,3 ... 63,63
63,0
.. .
(a) Conventional row-major order
2
.. .
62
(b) Skewed row-major order
Figure 26.2 Skewed storage of the elements of a 64  64 matrix
for conflict-free memory access in a 64-way interleaved memory.
Elements of column 0 are highlighted in both diagrams .
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 28
63
Overlapped Memory Access and Computation
To and from memory unit
Pipelined adder
Load X
Load Y
Store Z
Vector reg 0
Vector reg 1
Vector reg 2
Vector reg 3
Vector reg 4
Vector reg 5
Figure 26.3 Vector processing via segmented load/store of
vectors in registers in a double-buffering scheme. Solid (dashed)
lines show data flow in the current (next) segment.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 29
26.3 Vector Processor Performance
Time
Without
chaining

Multiplication
start-up
With pipeline
chaining
+
Addition
start-up

+
Figure 26.4 Total latency of the vector computation
S := X  Y + Z, without and with pipeline chaining.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 30
Clock cycles per vector element
Performance as a Function of Vector Length
5
4
3
2
1
0
0
100
200
300
400
Vector length
Figure 26.5 The per-element execution time in a vector processor
as a function of the vector length.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 31
26.4 Shared-Control Systems
(a) Shared-control array
processor, SIMD
(b) Multiple shared controls,
MSIMD
Control
Processing
...
Processing
...
Control
...
Processing
...
Control
(c) Separate controls,
MIMD
Figure 26.6 From completely shared control
to totally separate controls.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 32
Example Array Processor
Control
Processor array
Switches
Control
broadcast
Parallel
I/O
Figure 26.7 Array processor with 2D torus
interprocessor communication network.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 33
26.5 Array Processor Implementation
Reg
file
Data
memory
N
E
W
ALU
0
Commun
buffer
1
S
To array
state reg
CommunEn
PE state FF
To reg file and
data memory
To NEWS
neighbors
CommunDir
Figure 26.8 Handling of interprocessor communication
via a mechanism similar to data forwarding.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 34
Configuration Switches
Processor array
Control
Switches
Control
broadcast
Parallel
I/O
Figure 26.7
(a) Torus operation
Figure 26.9
Mar. 2007
(b) Clockwise I/O
In
In
Out
Out
(c) Counterclockwise I/O
I/O switch states in the array processor of Figure 26.7.
Computer Architecture, Advanced Architectures
Slide 35
26.6 Array Processor Performance
Array processors perform well for the same class of problems that
are suitable for vector processors
For embarrassingly (pleasantly) parallel problems, array processors
can be faster and more energy-efficient than vector processors
A criticism of array processing:
For conditional computations, a significant part of the array remains
idle while the “then” part is performed; subsequently, idle and busy
processors reverse roles during the “else” part
However:
Considering array processors inefficient due to idle processors
is like criticizing mass transportation because many seats are
unoccupied most of the time
It’s the total cost of computation that counts, not hardware utilization!
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 36
27 Shared-Memory Multiprocessing
Multiple processors sharing a memory unit seems naïve
• Didn’t we conclude that memory is the bottleneck?
• How then does it make sense to share the memory?
Topics in This Chapter
27.1 Centralized Shared Memory
27.2 Multiple Caches and Cache Coherence
27.3 Implementing Symmetric Multiprocessors
27.4 Distributed Shared Memory
27.5 Directories to Guide Data Access
27.6 Implementing Asymmetric Multiprocessors
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 37
Parallel Processing
as a Topic of Study
An important area of study
that allows us to overcome
fundamental speed limits
Our treatment of the topic is
quite brief (Chapters 26-27)
Graduate course ECE 254B:
Adv. Computer Architecture –
Parallel Processing
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 38
27.1 Centralized Shared Memory
Processors
Memory modules
0
Processortoprocessor
network
0
1
.
.
.
Processortomemory
network
1
.
.
.
p1
m1
...
Parallel I/O
Figure 27.1 Structure of a multiprocessor with
centralized shared-memory.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 39
Processor-to-Memory Interconnection Network
Processors
0
Memories
Row 0
0
1
2
1
2
Row 1
3
4
3
Row 2
4
5
6
5
6
Row 3
7
8
7
Row 4
9
10
Row 5
11
12
11
Row 6
12
13
14
15
13
14
Row 7
(a) Butterfly network
15
0
1
2
3
4
5
6
7
8
9
10
P
r
o
c
e
s
s
o
r
s
0
M
e
m
o
r
i
e
s
1
2
3
4
5
6
7
(b) Beneš network
Figure 27.2 Butterfly and the related Beneš network as examples
of processor-to-memory interconnection network in a multiprocessor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 40
Processor-to-Memory Interconnection Network
Processors
0
Sections Subsections
8
/
44
Memory banks
0, 4, 8, 12, 16, 20, 24, 28
32, 36, 40, 44, 48, 52, 56, 60
88
1
44
224, 228, 232, 236, . . . , 252
1  8 switches
2
8
/
44
1, 5, 9, 13, 17, 21, 25, 29
88
3
44
4
44
225, 229, 233, 237, . . . , 253
8
/
2, 6, 10, 14, 18, 22, 26, 30
88
5
44
6
44
226, 230, 234, 238, . . . , 254
8
/
3, 7, 11, 15, 19, 23, 27, 31
88
7
44
227, 231, 235, 239, . . . , 255
Figure 27.3 Interconnection of eight processors to 256 memory banks
in Cray Y-MP, a supercomputer with multiple vector processors.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 41
Shared-Memory Programming: Broadcasting
Copy B[0] into all B[i] so that multiple processors
can read its value without memory access conflicts
for k = 0 to log2 p – 1 processor j, 0  j < p, do
B[j + 2k] := B[j]
endfor
B
Recursive
doubling
Mar. 2007
0
1
2
3
4
5
6
7
8
9
10
11
Computer Architecture, Advanced Architectures
Slide 42
Shared-Memory Programming: Summation
Sum reduction of vector X
processor j, 0  j < p, do Z[j] := X[j]
s := 1
while s < p processor j, 0  j < p – s, do
Z[j + s] := X[j] + X[j + s]
s := 2  s
endfor
S
Recursive
doubling
Mar. 2007
0
1
2
3
4
5
6
7
8
9
0:0
1:1
2:2
3:3
4:4
5:5
6:6
7:7
8:8
9:9
0:0
0:1
1:2
2:3
3:4
4:5
5:6
6:7
7:8
8:9
0:0
0:1
0:2
0:3
1:4
2:5
3:6
4:7
5:8
6:9
Computer Architecture, Advanced Architectures
0:0
0:1
0:2
0:3
0:4
0:5
0:6
0:7
1:8
2:9
0:0
0:1
0:2
0:3
0:4
0:5
0:6
0:7
0:8
0:9
Slide 43
27.2 Multiple Caches and Cache Coherence
Processors Caches
Memory modules
0
Processortoprocessor
network
0
1
.
.
.
Processortomemory
network
p1
1
.
.
.
m1
...
Parallel I/O
Private processor caches reduce memory access traffic through the
interconnection network but lead to challenging consistency problems.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 44
Status of Data Copies
Processors
Caches
Memory modules
Multiple consistent
Processortoprocessor
network
0
w
z
w
x
1
w
y
y
Single inconsistent
Single consistent
p–1
Invalid
x
z
.
.
.
Processortomemory
network
0
1
.
.
.
z
m–1
...
Parallel I/O
Figure 27.4 Various types of cached data blocks in a parallel processor
with centralized main memory and private processor caches.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 45
A Snoopy
Cache Coherence
Protocol
P
P
P
P
C
C
C
C
Bus
Memory
CPU w rite miss:
signal write miss
on bus
Invalid
Bus write miss:
write back cache line
Bus write miss
CPU read miss:
signal read miss
on bus
Bus read miss: write back cache line
Exclusive
CPU read
or write hit
(writable)
Shared
CPU w rite hit: signal write miss on bus
(read-only)
CPU read hit
Figure 27.5 Finite-state control mechanism for a bus-based
snoopy cache coherence protocol with write-back caches.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 46
27.3 Implementing Symmetric Multiprocessors
Computing nodes
Interleaved memory
(typically, 1-4 CPUs
and caches per node)
I/O modules
Standard interfaces
Bus adapter
Bus adapter
Very wide, high-bandwidth bus
Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 47
Bus Bandwidth Limits Performance
Example 27.1
Consider a shared-memory multiprocessor built around a single bus with
a data bandwidth of x GB/s. Instructions and data words are 4 B wide,
each instruction requires access to an average of 1.4 memory words
(including the instruction itself). The combined hit rate for caches is 98%.
Compute an upper bound on the multiprocessor performance in GIPS.
Address lines are separate and do not affect the bus data bandwidth.
Solution
Executing an instruction implies a bus transfer of 1.4  0.02  4 = 0.112 B.
Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS.
Assuming a bus width of 32 B, no bus cycle or data going to waste, and
a bus clock rate of y GHz, the performance bound becomes 286y GIPS.
This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz.
Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is
beyond reach with this type of architecture.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 48
Implementing Snoopy Caches
State
Duplicate tags
and state store
for snoop side
CPU
Tags
Addr
Cmd
Cache
data
array
Snoop side
cache control
Main tags and
state store for
processor side
Processor side
cache control
=?
Tag
=?
Snoop
state
Cmd
Addr
Buffer Buffer
Addr
Cmd
System
bus
Figure 27.7 Main structure for a snoop-based cache coherence algorithm.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 49
27.4 Distributed Shared Memory
Processors with memory
Parallel input/output
y := -1
z := 1
Routers
0
1
z :0
.
.
.
while z=0 do
x := x + y
endwhile
p1
Interconnection
network
x:0
y:1
Figure 27.8 Structure of a distributed shared-memory multiprocessor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 50
27.5 Directories to Guide Data Access
Communication &
memory interfaces
Directories
Processors & caches
Memories
Parallel input/output
0
1
.
.
.
Interconnection
network
p1
Figure 27.9 Distributed shared-memory multiprocessor with a cache,
directory, and memory module associated with each processor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 51
Directory-Based Cache Coherence
Write miss: return value,
set sharing set to {c}
Read miss: return value,
set sharing set to {c}
Uncached
Write miss:
fetch data from owner,
request invalidation,
return value,
set sharing set to {c}
Data w rite-back:
set sharing set to { }
Read miss:
return value,
include c in
sharing set
Read miss: fetch data from owner,
return value, include c in sharing set
Exclusive
(writable)
Shared
Write miss: invalidate all cached copies,
set sharing set to {c}, return value
(read-only)
Figure 27.10 States and transitions for a directory entry in a directorybased cache coherence protocol (c is the requesting cache).
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 52
27.6 Implementing Asymmetric Multiprocessors
To I/O controllers
Node 0
Link
Memory
Computing nodes (typically, 1-4 CPUs and associated memory)
Node 1
Node 2
Node 3
Link
Link
Link
Ring
network
Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 53
Processors
and caches
Memories
To interconnection network
Scalable
Coherent
Interface
(SCI)
0
1
2
3
Figure 27.11 Structure of a ring-based
distributed-memory multiprocessor.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 54
28 Distributed Multicomputing
Computer architects’ dream: connect computers like toy blocks
• Building multicomputers from loosely connected nodes
• Internode communication is done via message passing
Topics in This Chapter
28.1 Communication by Message Passing
28.2 Interconnection Networks
28.3 Message Composition and Routing
28.4 Building and Using Multicomputers
28.5 Network-Based Distributed Computing
28.6 Grid Computing and Beyond
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 55
28.1 Communication by Message Passing
Parallel input/output
Memories and processors
0
1
A computing node
.
.
.
Interconnection
network
p1
Figure 28.1
Mar. 2007
Routers
Structure of a distributed multicomputer.
Computer Architecture, Advanced Architectures
Slide 56
Router Design
Routing and
arbitration
Ejection channel
LC
LC
Q
Q
Input channels
Input queues
Link controller
Message queue
Output queues
LC
Q
Q
LC
LC
Q
Q
LC
LC
Q
Q
LC
LC
Q
Q
LC
Switch
Output channels
Injection channel
Figure 28.2 The structure of a generic router.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 57
Building Networks from Switches
Straight through
Crossed connection
Lower broadcast
Upper broadcast
Figure 28.3 Example 2  2 switch with point-to-point
and broadcast connection capabilities.
Processors
0
Memories
Row 0
0
1
1
2
2
Row 1
3
4
3
Row 2
4
5
5
6
6
Row 3
7
7
Row 4
Figure 27.2
8
9
9
Butterfly and
Beneš networks
10
10
11
Row 6
12
13
14
14
Row 7
(a) Butterfly network
15
0
1
2
3
4
5
6
7
8
13
15
Mar. 2007
Row 5
11
12
P
r
o
c
e
s
s
o
r
s
0
M
e
m
o
r
i
e
s
1
2
3
4
5
6
7
(b) Beneš network
Computer Architecture, Advanced Architectures
Slide 58
Interprocess Communication via Messages
Communication
latency
Time
Process A
Process B
...
...
...
...
...
...
send x
...
...
...
...
...
...
...
...
...
receive x
...
...
...
Process B is suspended
Process B is awakened
Figure 28.4 Use of send and receive message-passing
primitives to synchronize two processes.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 59
28.2 Interconnection Networks
Nodes
(a) Direct network
Figure 28.5
Mar. 2007
Routers
Nodes
(b) Indirect network
Examples of direct and indirect interconnection networks.
Computer Architecture, Advanced Architectures
Slide 60
Direct
Interconnection
Networks
(a) 2D torus
(b) 4D hyperc ube
(c) Chordal ring
(d) Ring of rings
Figure 28.6 A sampling of common direct interconnection networks.
Only routers are shown; a computing node is implicit for each router.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 61
Indirect Interconnection Networks
Level-3 bus
Level-2 bus
Level-1 bus
(a) Hierarchical bus es
(b) Omega network
Figure 28.7 Two commonly used indirect interconnection networks.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 62
28.3 Message Composition and Routing
Message
Packet
data
Padding
First packet
Header
Data or payload
Last packet
Trailer
A transmitted
packet
Flow control
digits (flits)
Figure 28.8
Mar. 2007
Messages and their parts for message passing.
Computer Architecture, Advanced Architectures
Slide 63
Wormhole Switching
Each worm is blocked at the
point of attempted right turn
Destination 1
Destination 2
Worm 1:
moving
Worm 2:
blocked
Source 1
Source 2
(a) Two worms en route to their
respective destinations
(b) Deadlock due to circular waiting
of four blocked worms
Figure 28.9 Concepts of wormhole switching.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 64
28.4 Building and Using Multicomputers
Inputs
t=1
B
A
B
C
D
A
B
C
A
B E G H
C
F
D
E
F
GH
t=2
t=2
t=2
A
C
F
t=1
D
E
F
G
H
H
t=3
D
t=2
t=1
Outputs
G
Time
E
0
(a) Static task graph
Figure 28.10
Mar. 2007
5
10
(b) Schedules on 1-3 computers
A task system and schedules on 1, 2, and 3 computers.
Computer Architecture, Advanced Architectures
Slide 65
15
Building Multicomputers from Commodity Nodes
One module:
CPU(s),
memory,
disks
Expansion
slots
One module:
CPU,
memory,
disks
(a) Current racks of modules
Wireless
connection
surfaces
(b) Futuristic toy-block construction
Figure 28.11 Growing clusters using modular nodes.
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 66
28.5 Network-Based Distributed Computing
PC
Fast network
interface with
large memory
System or
I/O bus
NIC
Network built of
high-s peed
wormhole switches
Figure 28.12
Mar. 2007
Network of workstations.
Computer Architecture, Advanced Architectures
Slide 67
28.6 Grid Computing and Beyond
Computational grid is analogous to the power grid
Decouples the “production” and “consumption” of computational power
Homes don’t have an electricity generator; why should they have a computer?
Advantages of computational grid:
Near continuous availability of computational and related resources
Resource requirements based on sum of averages, rather than sum of peaks
Paying for services based on actual usage rather than peak demand
Distributed data storage for higher reliability, availability, and security
Universal access to specialized and one-of-a-kind computing resources
Still to be worked out: How to charge for computation usage
Mar. 2007
Computer Architecture, Advanced Architectures
Slide 68
Download