MS Word - NCSU COE People

advertisement
Vector processors
[Hennessy & Patterson, §B.1] Pipelines can be used to process
vectors as well as scalar quantities.
• In the vector case, a single instruction operates on more
than one set of operands.
• Multiple pipelines may be provided to allow different
operations to take place on different streams of data.
The cost/performance ratio of vector processors is impressive, but
only for problems that fit their mold.
Vector machines solve the problem that large scientific programs
have large, active data sets, and thus poor data locality. Cache
performance is thus quite bad.
In a vector machine, the compiler can often determine the access
pattern and pipeline the accesses efficiently.
Several characteristics of vector machines make them capable of
high performance:
• The computation of each result is independent of the
computation of previous results, so there are no pipeline
stalls due to hazards. How is this independence achieved?
• A single vector instruction can do the work of an entire
loop.
• Vector instructions process operands that have a regular
structure, and can thus be efficiently loaded into registers
from interleaved memory.
Basic vector architecture
[H&P, §B.2] A vector machine usually has—
• an ordinary pipelined scalar unit, and
• a vector unit.
Lecture 25
Architecture of Parallel Computers
1
Most machines have integer, boolean, and floating-point instructions
that operate on the data.
All vector machines built today are vector-register machines: All
instructions take their operands from vector registers except load and
store.
What would be the alternative to a vector-register machine?
Here is a diagram of a
sample vector
architecture. It is
Hennessy and Patterson’s DLXV, and is
loosely based on the
Cray-1.
Its scalar portion is
Hennessy and
Patterson’s DLX (see
ECE 521).
Main memory
Vector
load/store
FP add/sub.
FP multiply
FP divide
Vector
registers
Integer
Boolean
Scalar
registers
The architecture has these components:
• Vector registers: 8 registers, each of which holds 64
doublewords.
• Vector functional units: Each unit is fully pipelined and can
start a new operation on each clock cycle. A control unit
must detect conflicts for functional units and data
accesses.
• Vector load/store unit: A vector memory unit that loads or
stores a vector to or from memory. Moves one word/cycle,
after a startup latency.
• Scalar registers: 32 general-purpose and 32 floating-point.
On the next page is a table, taken from Hennessy and Patterson
(Fig. B.2), giving the characteristics of several vector-register
machines.
Lecture 25
Architecture of Parallel Computers
2
All vector operations in DLXV are double-precision floating point.
Instruction set
There are five kinds of vector instructions, depending on whether the
operands and results are vectors or scalars:
Type
f1:
f2:
f3:
f4:
f5:
Lecture 25
Example
V
V
V
V
V
V
 S
 V  V
 S  V
V  S
LV
POP
ADDV
ADDSV
S_V
Architecture of Parallel Computers
3
Processor
Year
announ
ced
Clock
rate
(MHz)
Registers
Elements
per register
(64-bit
elements)
Functional units
Loadstore
units
Cray-1
1976
80
8
64
6: add, multiply, reciprocal,
integer add, logical, shift
Cray X-MP
Cray Y-MP
1983
1988
120
166
8
64
8: FP add, FP multiply, FP reciprocal, integer add, 2 logical,
shift, population count/parity
2 load
1 store
Cray-2
1985
166
8
64
5: FP add, FP multiply, FP reciprocal/sqrt, integer (add shift,
population count), logical
1
Fujitsu
VP100/20
1982
133
3: FP or integer add/logical,
multiply, divide
2
Hitachi
S810/820
1985
71
32
256
4: 2 integer add/logical,
1 multiply-add, and
1 multiply/divide-add unit
4
Convex C-1
1985
10
8
128
4: multiply, add, divide,
integer/logical
1
NEC SX/2
1984
160
8+
8192
256
variable
16: 4 integer add/logical, 4 FP
multiply/divide, 4 FP add, 4 shift
8
DLXV
1990
200
8
64
5: multiply, divide, add, integer
add, logical
1
Cray C-90
1991
240
8
128
8: FP add, FP multiply, FP
reciprocal, integer add, 2 logical,
shift, population count/parity
4
Convex C-4
1994
135
16
128
3: each is full integer, logical,
and FP (including multiply-add)
NEC SX/4
1995
400
8+
8192
256
variable
16: 4 integer add/logical, 4 FP
multiply/divide, 4 FP add, 4 shift
Cray J-90
1995
100
8
64
4: FP add, FP multiply, FP
reciprocal, integer/logical
Cray T-90
1996
≈500
8
128
8: FP add, FP multiply, FP
reciprocal, integer add, 2 logical,
shift, population count/parity
8–256
32–1024
1
8
4
Two special-purpose registers are also needed:
VLR
the vector-length register, and
VM
the vector-mask register.
Here are the vector instructions in DLXV:
Lecture 25
Architecture of Parallel Computers
4
Opcode
Operands
Function
ADDV
V1, V2, V3
Add elements of V2 and V3, then put each result in V1.
ADDSV
V1, V2, F0
Add F0 to each elt. of V2, then put each result in V1
SUBV
V1, V2, V3
Subtract elts. of V3 from V2, then put each result in V1.
SUBVS
V1, V2, F0
Subtract F0 from elts. of V2, then put each result in V1.
SUBSV
V1, F0, V2
Subtract elts. of V2 from F0, then put each result in V1.
MULTV
V1, V2, V3
Multiply elts. of V2 and V3, then put each result in V1.
MULTSV
V1, F0, V2
Multiply F0 by each elt. of V2, then put each result in V1
DIVV
V1, V2, V3
Divide elements of V2 by V3, then put each result in V1.
DIVVS
V1, V2, F0
Divide elements of V2 by F0, then put each result in V1.
DIVSV
V1, F0, V2
Divide F0 by elements of V2, then put each result in v1.
LV
V1, R1
Load vector reg. V1 from memory starting at addr. R1
SV
R1, V1
Store vector reg. V1 into memory starting at address R1
LVWS
V1, (R1,R2)
Load V1 from addr. at R1 w/stride in R2, i.e., R1 + I * R2.
SVWS
(R1,R2), V1 Store V1 at addr. in R1 w/stride in R2, i.e., R1 + I*R2.
LVI
V1,(R1+V2)
Load V1 with vector whose elements are at R1+V2(I),
i.e., V2 is an index.
SVI
(R1+V2),
V1
Store V1 with vector whose elements are at R1+V2(I).
CVI
V1, R1
Create an index vector by storing the values 0, 1*R1,
2*R1, … , 63*R1 into V1.
S_ _V
V1, V2
S_ _SV
F0, V1
Compare (=, ≠, >, <, ≥, ≤) the elements in V1 and V2. If
condition is true, put a 1 in the corresponding bit
vector; otherwise put a 0. Put resulting bit vector into
vector-mask register (VM). The instruction S_ _SV
performs the same compare, but using a scalar value
as one operand.
Opcode
Operands
Function
POP
R1, VM
Count the 1s in VM and store count in R1.
Set the vector-mask register to all 1s.
CVM
MOVI2S
Lecture 25
VLR, R1
Move contents of R1 to the vector-length register.
Architecture of Parallel Computers
5
MOVS2I
R1, VLR
Move the contents of the vector-length register to R1.
MOVF2S
VM, F0
Move contents of F0 to the vector-mask register.
MOVS2F
F0, VM
Move the contents of the vector-mask register to F0.
Let us take a look at a sample program. We want to calculate
Y := a *X + Y
X and Y are vectors, initially in memory, and a is a scalar. This is the
DAXPY (double-precision a times X plus Y) that forms the inner loop
of the Linpack benchmark’s Gaussian elimination routine.
For now, we will assume that the length of the vector is 64, the same
as the length of a vector register.
We assume that the starting addresses of X and Y are in RX and RY,
respectively.
Here is the code for DLX (non-vector architecture).
1.
2.
LD
ADDI
F0, a
R4,RX,#512 Last address to load.
LD
MULTD
LD
ADDD
SD
ADDI
F2,0(RX)
F2,F0,F2
F4,0(RY)
F4,F2,F4
F4,0(RY)
RX,RX,#8
loop:
3.
4.
5.
6.
7.
8.
9.
10.
11.
ADDI
SUB
BNZ
Load X[i]
a *X[i]
Load Y[i]
a *X[i] + Y[i]
Store into Y[i]
Increment index for X
RY,RY,#8
Increment index for Y
R20,R4,RX Compute bound
R20,loop
Check if done.
Here is the DLXV code.
1.
2.
Lecture 25
LD
LV
F0,a
V1,RX
Load scalar a
Load vector X
Architecture of Parallel Computers
6
3.
4.
5.
6.
MULTSV
LV
V3,RY
ADDV
V4,
SV
RY,V4
Vector-scalar multiply
Load vector Y
Add
Store the result
In going from the scalar to the vector code, the number of
is reduced from almost 600 to 6.
There are two reasons for this.
•
•
What can you say about pipeline interlocks? How does the change
from scalar to vector code affect them?
Vector execution time
The execution time of a sequence of vector operations depends on—
• length of vectors being operated on,
• structural hazards (conflicts for resources) between
operations, and
• data dependencies.
If two instructions contain no data dependencies or structural
hazards, they could potentially begin execution on the same clock
cycle.
(Most architectures can only initiate one operation per clock cycle,
but for reasonably long vectors, the execution time of a sequence
depends much more on the length of the vectors than on whether
more than one operation can be initiated per cycle.)
Lecture 25
Architecture of Parallel Computers
7
Hennessy and Patterson introduce the term convoy to refer to the set
of vector instructions that could potentially begin execution in a single
clock period.
It provides an approximation to the time required for a sequence of
vector operations.
A chime is the time it takes to execute a convoy (the time is called 1
convoy, regardless of vector length).
Thus,
• a vector sequence that consists of m convoys executed in
m chimes;
• if vector length is n, this is approximately mn clock cycles.
Measuring in terms of chimes explicitly ignores processor-specific
overheads.
Because of the fact that most architectures can initiate only one
vector operation per cycle, the chime count underestimates the
actual execution time.
Example: Consider this code:
LV
V1, RX
MULTSV V2, F0, V1
LV
V3, RY
ADDV V4, V2, V3
SV
RY, V4
Load vector X.
Vector-scalar multiply
Load vector Y
Add
Store the result
How many chimes will this sequence take?
How many chimes per FLOP are there?
Lecture 25
Architecture of Parallel Computers
8
The chime approximation is reasonably accurate for long vectors.
For 64-element vectors, four chimes ≈
clock cycles. The
overhead of issuing Convoy 2 in two separate clocks is small.
Startup time
The startup time is the chief overhead ignored by the chime model.
For load and store operations, it is usually longer than for other
operations. For loads:
• up to 50 clock cycles on some machines.
• 9 to 17 clock cycles on Cray 1 and Cray X/MP.
• 12 clock cycles on DLXV.
For stores, startup time makes less difference.
• Stores do not produce results, so it is not necessary to wait
for them to finish.
• Unless, e.g., a load has to wait for a store to finish when
there is only one memory pipeline.
Here are the startup penalties for DLXV vector operations:
Operation
Startup
penalty
Vector add
6
Vector multiply
7
Vector divide
20
Vector load
12
Assuming these times, how long will it really take to execute our fourconvoy example, above?
Convoy
1. LV
2. MULTSV LV
Lecture 25
Start time
1st-result time
Last-result
time
0
12 + n +
Architecture of Parallel Computers
24 + 2n
9
3. ADDV
4. SV
31 + 3n
25 + 2n +
30 + 3n
31 + 3n + 12
42 + 4n
Thus, assuming that no other instruction can overlap the final SV, we
have an execution time of 42 + 4n.
For a vector of length 64, this evaluates to
The number of clock cycles per result is therefore
The chime approximation is
clock cycles per result. Thus the
execution time with startup is 1.16 times higher.
Effect of the memory system
To maintain a rate of one word per cycle, the memory system must
be able to produce/accept data this fast.
This is done by creating separate memory banks that can be
accessed in parallel. This is similar to interleaving.
Memory banks can be implemented in two ways:
• Parallel access. Access all the banks in parallel, latch the
results, then transfer them one by one. Like S access to
interleaved memory. This is essentially S-access
interleaving.
• Independent bank phasing. On first access, access all
banks in parallel, then transfer words one at a time. Begin
a new access as soon as one access is finished. This is
essentially C-access interleaving.
If an initiation rate of one word per clock cycle is to be maintained, it
must be true that—
# of memory banks ≥ memory access time in clock cycles.
Example: We want to fetch—
• a vector of 64 elements
Lecture 25
Architecture of Parallel Computers
10
• starting at byte address 136,
• with each memory access taking 6 “clocks.”
Then,
• How many memory banks do we need?
• What addresses are used to access the banks?
• When will each element arrive at the CPU?
We need 8 banks so that references can proceed at 1/clock cycle.
We will employ independent bank phasing.
Here is a table showing the memory addresses (in bytes) by
• bank number, and
• time at which access begins.
Bank
Beginning
at cycle #
0
1
2
3
4
5
6
7
0
192
136
144
152
160
168
176
184
6
256
200
208
216
224
232
240
248
14
320
264
272
280
288
296
304
312
22
384
328
336
344
352
360
368
376
…
…
…
…
…
…
…
…
…
Note that—
• Accesses begin 8 bytes apart, since
• The exact time that a bank transmits its data is given by
its address – starting address
+ the memory latency.
8
Memory latency is 6 cycles.
• The first access is to bank 1. Why?
Lecture 25
Architecture of Parallel Computers
11
• Like C-access interleaving, memory banks require address
latches for each bank.
• Why does the second go-round of accesses start at time 6
rather than time 8?
Here is a time line showing when each access occurs.
Action
Time 0
Lecture 25
Next mem. Next mem.
access &
access &
Memory deliver prev. deliver prev.
access
8 words
8 words
6
14
22
Architecture of Parallel Computers
Deliver last
8 words
62
70
12
Download