Vector Computers

advertisement
Lecture 12, Slide 1
Computer Architecture
Vector Computers
Lecture 12, Slide 2
contents
1. Why Vector Processors?
2. Basic Vector Architecture
3. How Vector Processors Work
4. Vector Length and Stride
5. Effectiveness of Compiler Vectorization
6. Enhancing Vector Performance
7. Performance of Vector Processors
Lecture 12, Slide 3
Vector Processors
I’m certainly not inventing vector processors. There
are three kinds that I know of existing today. They are
represented by the Illiac-IV, the (CDC) Star processor,
and the TI (ASC) processor. Those three were all
pioneering processors. . . . One of the problems of being
a pioneer is you always make mistakes and I never,
never want to be a pioneer. It’s always best to come
second when you can look at the mistakes the pioneers
made.
Seymour Cray
Public lecture at Lawrence Livermore Laboratories
on the introduction of the Cray-1 (1976)
Lecture 12, Slide 4
Supercomputers
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an I/O
bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
Lecture 12, Slide 5
Supercomputer Applications
Typical application areas
• Military research (nuclear weapons, cryptography)
• Scientific research
• Weather forecasting
• Oil exploration
• Industrial design (car crash simulation)
All involve huge computations on large data sets
In 70s-80s, Supercomputer  Vector Machine
Lecture 12, Slide 6
1. Why Vector Processors?
• A single vector instruction specifies a great deal of
work—it is equivalent to executing an entire loop.
• The computation of each result in the vector is
independent of the computation of other results in the
same vector and so hardware does not have to check for
data hazards within a vector instruction.
• Hardware need only check for data hazards between two
vector instructions once per vector operand, not once for
every element within the vectors.
• Vector instructions that access memory have a known
access pattern.
• Because an entire loop is replaced by a vector instruction
whose behavior is predetermined, control hazards that
would normally arise from the loop branch are
nonexistent.
Lecture 12, Slide 7
2. Basic Vector Architecture
• There are two primary types of architectures
for vector processors: vector-register
processors and memory-memory vector
processors.
– In a vector-register processor, all vector
operations—except load and store—are among the
vector registers.
– In a memory-memory vector processor, all vector
operations are memory to memory.
Vector Memory-Memory versus
Vector Register Machines
Lecture 12, Slide 8
• Vector memory-memory instructions hold all vector operands in
main memory
• The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were
memory-memory machines
• Cray-1 (’76) was first vector register machine
Vector Memory-Memory Code
Example Source Code
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i];
D[i] = A[i] - B[i];
}
ADDV C, A, B
SUBV D, A, B
Vector Register Code
LV V1, A
LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
SV V4, D
Vector Memory-Memory vs. Vector
Register Machines
Lecture 12, Slide 9
• Vector memory-memory architectures (VMMA) require
greater main memory bandwidth, why?
– All operands must be read in and out of memory
• VMMAs make if difficult to overlap execution of multiple
vector operations, why?
– Must check dependencies on memory addresses
• VMMAs incur greater startup latency
– Scalar code was faster on CDC Star-100 for vectors < 100 elements
– For Cray-1, vector/scalar breakeven point was around 2 elements
Apart from CDC follow-ons (Cyber-205, ETA-10) all major
vector machines since Cray-1 have had vector register
architectures
(we ignore vector memory-memory from now on)
The basic structure of
a vector-register architecture
VMIPS
Lecture 12, Slide 10
Lecture 12, Slide 11
Primary Components of VMIPS
• Vector registers — VMIPS has eight vector registers, and
each holds 64 elements. Each vector register must have
at least two read ports and one write port.
• Vector functional units — Each unit is fully pipelined and
can start a new operation on every clock cycle.
• Vector load-store unit —The VMIPS vector loads and
stores are fully pipelined, so that words can be moved
between the vector registers and memory with a
bandwidth of 1 word per clock cycle, after an initial
latency.
• A set of scalar registers —Scalar registers can also
provide data as input to the vector functional units, as well
as compute addresses to pass to the vector load-store
unit.
Lecture 12, Slide 12
Vector Supercomputers
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions
•
•
•
•
•
•
•
•
Load/Store Architecture
Vector Registers
Vector Instructions
Hardwired Control
Highly Pipelined Functional Units
Interleaved Memory System
No Data Caches
No Virtual Memory
Cray-1 (1976)
Lecture 12, Slide 13
Cray-1 (1976)
64 Element Vector
Registers
Single Port
Memory
16 banks of 64bit words
+
8-bit SECDED
( (Ah) + j k m )
(A0)
64
T Regs
Si
Tjk
V0
V1
V2
V3
V4
V5
V6
V7
S0
S1
S2
S3
S4
S5
S6
S7
Lecture 12, Slide 14
Vi
V. Mask
Vj
V. Length
Vk
FP Add
Sj
FP Mul
Sk
FP Recip
Si
Int Add
Int Logic
Int Shift
80MW/sec data
load/store
( (Ah) + j k m )
(A0)
320MW/sec
instruction
buffer refill
64
B Regs
Ai
Bjk
A0
A1
A2
A3
A4
A5
A6
A7
64-bitx16
NIP
4 Instruction Buffers
LIP
memory bank cycle 50 ns
Pop Cnt
Aj
Ak
Ai
Addr Add
Addr Mul
CIP
processor cycle 12.5 ns (80MHz)
Lecture 12, Slide 15
Vector Programming Model
Scalar Registers
r7
r0
Vector Registers
v7
v0
[0]
[1]
[2]
[VLRMAX-1]
Vector Length Register
Vector Arithmetic
Instructions
ADDV v3, v1, v2
v1
v2
+
+
[0]
[1]
+
+
+
+
v3
Vector Load and Store
Instructions
LV v1, r1, r2
Base, r1
VLR
Stride, r2
v1
[VLR-1]
Vector Register
Memory
Lecture 12, Slide 16
• In VMIPS, vector operations use the same names as MIPS
operations, but with the letter “V” appended.
Lecture 12, Slide 17
Vector Code Example
# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];
# Scalar Code
LI R4, 64
loop:
L.D F0, 0(R1)
L.D F2, 0(R2)
ADD.D F4, F2, F0
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop
# Vector Code
LI VLR, 64
LV V1, R1
LV V2, R2
ADDV.D V3, V1, V2
SV V3, R3
Lecture 12, Slide 18
Vector Instruction Set Advantages
• Compact
– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:
–
–
–
–
–
–
are independent
use the same functional unit
access disjoint registers
access registers in the same pattern as previous instructions
access a contiguous block of memory (unit-stride load/store)
access memory in a known pattern (strided load/store)
• Scalable
– can run same object code on more parallel pipelines or lanes
3.
How Vector Processors Work
Lecture 12, Slide 19
3.1 An Example
• Let’s take a typical vector problem, X and Y are vectors,
a is a scalar.
• Y = a×X + Y
• This is the socalled SAXPY or DAXPY loop that forms
the inner loop of the Linpack benchmark.
• Example Show the code for MIPS and VMIPS for the
DAXPY loop.
• Assume that the starting addresses of X and Y are in Rx
and Ry. And the number of elements, or length, of a
vector register(64) matches the length of the vector
operation.
Lecture 12, Slide 20
• Here is the MIPS code.
L.D
F0,a
DADDIU
R4,Rx,#512
Loop: L.D
F2,0(Rx)
MUL.D
F2,F2,F0
L.D
F4,0(Ry)
ADD.D
F4,F4,F2
S.D
0(Ry),F4
DADDIU
Rx,Rx,#8
DADDIU
Ry,Ry,#8
DSUBU
R20,R4,Rx
BNEZ
R20,Loop
;load scalar a
;last address to load
;load X(i)
;a × X(i)
;load Y(i)
;a × X(i) + Y(i)
;store into Y(i)
;increment index to X
;increment index to Y
;compute bound
;check if done
• Here is the VMIPS code for DAXPY.
L.D
F0,a
;load scalar a
LV
V1,Rx
;load vector X
MULVS.D
V2,V1,F0
;vector-scalar multiply
LV
V3,Ry
;load vector Y
ADDV.D
V4,V2,V3
;add
SV
Ry,V4 ;store the result
Lecture 12, Slide 21
• The most dramatic comparison is that the vector
processor greatly reduces the dynamic instruction
bandwidth.
• Another important difference is the frequency of pipeline
interlocks. (Pipeline stalls are required only once per
vector operation, rather than once per vector element.)
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) to
execute element operations
• Simplifies control of deep pipeline
because elements in vector are
independent (=> no hazards!)
V1
V2
Lecture 12, Slide 22
V3
Six stage multiply pipeline
V3 <- v1 * v2
3.2 Vector Load-Store Units and
Vector Memory Systems
Operation
Start-up penalty
Vector add
Vector multiply
Vector divide
Vector load / store
6
7
20
12
Lecture 12, Slide 23
Start-up penalties (in clock cycles) on VMIPS
To maintain an initiation rate of 1 word fetched or stored
per clock, the memory system must be capable of
producing or accepting this much data. This is usually
done by spreading accesses across multiple
independent memory banks.
Lecture 12, Slide 24
Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Cycles between accesses to same bank
Base Stride
Vector Registers
Address
Generator
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
+
Lecture 12, Slide 25
• Example
Suppose we want to fetch a vector of 64 elements starting
at byte address 136, and a memory access takes 6 clocks.
How many memory banks must we have to support one
fetch per clock cycle? With what addresses are the banks
accessed? When will the various elements arrive at the
CPU?
• Answer
Six clocks per access require at least six banks, but
because we want the number of banks to be a power of
two, we choose to have eight banks. Figure on next page
shows the timing for the first few sets of accesses for an
eight-bank system with a 6-clock-cycle access latency.
Lecture 12, Slide 26
The CPU cannot keep all eight banks busy all the time because it is limited to
supplying one new address and receiving one data item each cycle.
4. Two Real-World Issues:
Vector Length and Stride
Lecture 12, Slide 27
• What do you do when the vector length in a program is
not exactly 64?
• How do you deal with nonadjacent elements in vectors
that reside in memory?
4.1 Vector-Length Control
10
do 10 i = 1,n
Y(i) = a × X(i) + Y(i)
n may not even be known until run time
Lecture 12, Slide 28
• The solution is to create a vector-length
register (VLR), which controls the length of
any vector operation.
• The value in the VLR, however, cannot be
greater than the length of the vector registers
— maximum vector length (MVL).
• If the vector is longer than the maximum
length, a technique called strip mining is used.
Vector Stripmining
Lecture 12, Slide 29
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit into vector registers,
“Stripmining”
ANDI R1, N, #63 ; N mod 64
MTC1 VLR, R1
; Do remainder
for (i=0; i<N; i++)
loop:
C[i] = A[i]+B[i];
LV V1, RA
A B
C
DSLL R2, R1, #3 ; Multiply by 8
Remainder
+
DADDU RA, RA, R2 ; Bump pointer
LV V2, RB
DADDU RB, RB, R2
64 elements ADDV.D V3, V1, V2
+
SV V3, RC
DADDU RC, RC, R2
DSUBU N, N, R1 ; Subtract elements
+
LI R1, #64
MTC1 VLR, R1
; Reset full length
BGTZ N, loop
; Any more to do?
4.2 Vector Stride
Lecture 12, Slide 30
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10
A(i,j) = A(i,j)+B(i,k)*C(k,j)
At the statement labeled 10 we could vectorize the
multiplication of each row of B with each column of C.
When an array is allocated memory, it is linearized and
must be laid out in either row-major or column-major
order. This linearization means that either the elements
in the row or the elements in the column are not
adjacent in memory.
Lecture 12, Slide 31
Vector Stride
This distance separating elements that are to
be gathered into a single register is called the
stride.
• The vector stride, like the vector starting address, can
be put in a general-purpose register.
• Then the VMIPS instruction LVWS (load vector with
stride) can be used to fetch the vector into a vector
register.
• Likewise, when a nonunit stride vector is being
stored, SVWS (store vector with stride) can be used.
Lecture 12, Slide 32
5. Effectiveness of Compiler
Vectorization
• Two factors affect the success with which a
program can be run in vector mode.
• The first factor is the structure of the program
itself. This factor is influenced by the
algorithms chosen and by how they are coded.
• The second factor is the capability of the
compiler.
Automatic Code Vectorization
Lecture 12, Slide 33
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Vectorized Code
Scalar Sequential Code
load
load
Iter. 1
add
store
load
load
Iter. 2
add
store
load
load
Time
load
Iter.
1
load
add
add
store
store
Iter.
2
Vector Instruction
Vectorization is a massive compile-time
reordering of operation sequencing
 requires extensive loop dependence
analysis
Lecture 12, Slide 34
6. Enhancing Vector Performance
In this section we present five techniques for improving
the performance of a vector processor.
•
•
•
•
•
Chaining
Conditionally Executed Statements
Sparse Matrices
Multiple Lanes
Pipelined Instruction Start-Up
Lecture 12, Slide 35
(1) Vector Chaining
• the Concept of Forwarding Extended to Vector Registers
• Vector version of register bypassing
– introduced with Cray-1
LV
v1
V1
MULV v3,v1,v2
V2
V3
V4
ADDV v5, v3, v4
Chain
Load
Unit
Memory
Chain
Mult.
Add
V5
Lecture 12, Slide 36
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time
Add
• With chaining, can start dependent instruction as soon as
first result appears
Load
Mul
Add
Lecture 12, Slide 37
Implementations of Chaining
• Early implementations worked like forwarding,
but this restricted the timing of the source and
destination instructions in the chain.
• Recent implementations use flexible chaining,
which requires simultaneous access to the
same vector register by different vector
instructions, which can be implemented either
by adding more read and write ports or by
organizing the vector-register file storage into
interleaved banks in a similar way to the
memory system.
(2) Vector Conditional Execution
Lecture 12, Slide 38
Problem: Want to vectorize loops with conditional code:
for (i=0; i<N; i++)
if (A[i]>0) then
A[i] = B[i];
Solution: Add vector mask (or flag) registers
– vector version of predicate registers, 1 bit per element
…and maskable vector instructions
– vector operation becomes NOP at elements where mask bit is clear
Code example:
CVM
LV VA, RA
L.D F0,#0
SGTVS.D VA, F0
LV VA, RB
SV VA, RA
;
;
;
;
;
;
Turn on all elements
Load entire A vector
Load FP zero into F0
Set bits in mask register where A>0
Load B vector into A under mask
Store A back to memory under mask
Lecture 12, Slide 39
Masked Vector Instructions
Simple Implementation
– execute all N operations, turn off result
writeback according to mask
Density-Time Implementation
– scan mask vector and only execute
elements with non-zero masks
M[7]=1
A[7]
B[7]
M[7]=1
M[6]=0
A[6]
B[6]
M[6]=0
M[5]=1
A[5]
B[5]
M[5]=1
M[4]=1
A[4]
B[4]
M[4]=1
M[3]=0
A[3]
B[3]
M[3]=0
C[5]
M[2]=0
C[4]
M[2]=0
C[2]
M[1]=1
C[1]
A[7]
B[7]
M[1]=1
M[0]=0
C[1]
Write data port
M[0]=0
Write Enable
C[0]
Write data port
Compress/Expand Operations
Lecture 12, Slide 40
• Compress packs non-masked elements from one vector
register contiguously at start of destination vector
register
– population count of mask vector gives packed vector length
• Expand performs inverse operation
M[7]=1
A[7]
A[7]
A[7]
M[7]=1
M[6]=0
A[6]
A[5]
B[6]
M[6]=0
M[5]=1
A[5]
A[4]
A[5]
M[5]=1
M[4]=1
A[4]
A[1]
A[4]
M[4]=1
M[3]=0
A[3]
A[7]
B[3]
M[3]=0
M[2]=0
A[2]
A[5]
B[2]
M[2]=0
M[1]=1
A[1]
A[4]
A[1]
M[1]=1
M[0]=0
A[0]
A[1]
B[0]
M[0]=0
Compress
Expand
Used for density-time conditionals and also for general selection
operations
Lecture 12, Slide 41
(3) Sparse Matrices
Lecture 12, Slide 42
Vector Scatter/Gather
Want to vectorize loops with indirect accesses:
(index vector D designate the nonzero elements of C)
for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
Indexed load instruction (Gather)
LV
LVI
LV
ADDV.D
SV
VD, RD
VC,(RC, VD)
VB, RB
VA, VB, VC
VA, RA
;
;
;
;
;
Load indices in D vector
Load indirect from RC base
Load B vector
Do add
Store result
Lecture 12, Slide 43
Vector Scatter/Gather
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
Is following a correct translation?
LV
LVI
ADDV
SVI
VB, RB
; Load indices in B vector
VA,(RA, VB) ; Gather initial A values
VA, RA, 1
; Increment
VA,(RA, VB) ; Scatter incremented values
Lecture 12, Slide 44
(4) Multiple Lanes
ADDV C,A,B
Execution using
one pipelined
functional unit
Vector Instruction Execution
Execution using
four pipelined
functional units
A[6]
B[6]
A[24]
B[24]
A[25]
B[25] A[26]
B[26] A[27]
B[27]
A[5]
B[5]
A[20]
B[20]
A[21]
B[21] A[22]
B[22] A[23]
B[23]
A[4]
B[4]
A[16]
B[16]
A[17]
B[17] A[18]
B[18] A[19]
B[19]
A[3]
B[3]
A[12]
B[12]
A[13]
B[13] A[14]
B[14] A[15]
B[15]
C[2]
C[8]
C[9]
C[10]
C[11]
C[1]
C[4]
C[5]
C[6]
C[7]
C[0]
C[0]
C[1]
C[2]
C[3]
Lecture 12, Slide 45
Vector Unit Structure
Functional Unit
Vector
Registers
Elements 0,
4, 8, …
Elements 1,
5, 9, …
Elements 2,
6, 10, …
Lane
Memory Subsystem
Elements 3,
7, 11, …
T0 Vector Microprocessor (1995)
Vector register
elements striped
over lanes
Lecture 12, Slide 46
Lane
[24] [25]
[16] [17]
[8] [9]
[0] [1]
[26]
[18]
[10]
[2]
[27] [28] [29] [30] [31]
[19] [20] [21] [22] [23]
[11] [12] [13] [14] [15]
[3] [4] [5] [6] [7]
Lecture 12, Slide 47
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
Load Unit
Multiply Unit
Add Unit
load
mul
add
time
load
mul
add
Instruction
issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle
Lecture 12, Slide 48
(5) Pipelined Instruction Start-Up
• The simplest case to consider is when two vector
instructions access a different set of vector registers.
• For example, in the code sequence
ADDV.D V1,V2,V3
ADDV.D V4,V5,V6
• It becomes critical to reduce start-up overhead by allowing
the start of one vector instruction to be overlapped with the
completion of preceding vector instructions.
• An implementation can allow the first element of the
second vector instruction to immediately follow the last
element of the first vector instruction down the FP adder
pipeline.
Vector Startup
Lecture 12, Slide 49
Two components of vector startup penalty
– functional unit latency (time through pipeline)
– dead time or recovery time (time before another vector instruction can
start down pipeline)
Functional Unit Latency
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
W
R
X
X
X
First Vector Instruction
Dead Time
Dead Time
Second Vector Instruction
W
Lecture 12, Slide 50
Dead Time and Short Vectors
No dead time
4 cycles dead time
64 cycles active
Cray C90, Two lanes
4 cycle dead time
Maximum efficiency 94% with
128 element vectors
T0, Eight lanes
No dead time
100% efficiency with 8 element
vectors
Lecture 12, Slide 51
• Example The Cray C90 has two lanes but requires 4 clock
cycles of dead time between any two vector instructions to
the same functional unit. For the maximum vector length of
128 elements, what is the reduction in achievable peak
performance caused by the dead time? What would be the
reduction if the number of lanes were increased to 16?
• Answer A maximum length vector of 128 elements is divided
over the two lanes and occupies a vector functional unit for
64 clock cycles. The dead time adds another 4 cycles of
occupancy, reducing the peak performance to 64/(64 + 4) =
94.1% of the value without dead time.
• If the number of lanes is increased to 16, maximum length
vector instructions will occupy a functional unit for only
128/16 = 8 cycles, and the dead time will reduce peak
performance to 8/(8 + 4) = 66.6% of the value without dead
time.
7. Performance of Vector
Processors
Vector Execution Time
The execution time of a sequence of vector
operations primarily depends on three factors:
• the length of the operand vectors
• structural hazards among the operations
• data dependences
Lecture 12, Slide 52
Lecture 12, Slide 53
Convoy and Chime
• Convoy is the set of vector instructions that could
potentially begin execution together in one clock
period.
– The instructions in a convoy must not contain any structural or data
hazards; if such hazards were present, the instructions in the potential
convoy would need to be serialized and initiated in different convoys.
• A chime is the unit of time taken to execute one
convoy.
– A chime is an approximate measure of execution time for a vector
sequence; a chime measurement is independent of vector length.
– A vector sequence that consists of m convoys executes in m chimes,
and for a vector length of n, this is approximately m × n clock cycles.
Lecture 12, Slide 54
• Example Show how the following code sequence lays out in
convoys, assuming a single copy of each vector functional
unit:
LV
MULVS.D
LV
ADDV.D
SV
V1,Rx
;load vector X
V2,V1,F0
;vector-scalar multiply
V3,Ry
;load vector Y
V4,V2,V3
;add
Ry,V4 ;store the result
• How many chimes will this vector sequence take?
• How many cycles per FLOP (floating-point operation) are
needed, ignoring vector instruction issue overhead?
Lecture 12, Slide 55
• Answer The first convoy is occupied by the first LV instruction.
The MULVS.D is dependent on the first LV, so it cannot be in
the same convoy. The second LV instruction can be in the
same convoy as the MULVS.D. The ADDV.D is dependent on
the second LV, so it must come in yet a third convoy, and
finally the SV depends on the ADDV.D, so it must go in a
following convoy.
1. LV
2. MULVS.D LV
3. ADDV.D
4. SV
• The sequence requires four convoys and hence takes four
chimes. Since the sequence takes a total of four chimes and
there are two floating-point operations per result, the number
of cycles per FLOP is 2 (ignoring any vector instruction issue
overhead).
Lecture 12, Slide 56
Start-up overhead
• The most important source of overhead ignored by
the chime model is vector start-up time.
• The start-up time comes from the pipelining latency
of the vector operation and is principally determined
by how deep the pipeline is for the functional unit
used.
Unit
Load and store unit
Multiply unit
Add unit
Start-up overhead (cycles)
12
7
6
Lecture 12, Slide 57
• Example Assume the start-up overhead for functional units is
shown in Figure of the previous page.
• Show the time that each convoy can begin and the total
number of cycles needed. How does the time compare to the
chime approximation for a vector of length 64?
• Answer
The time per result for a vector of length 64 is 4 + (42/64) =
4.65 clock cycles, while the chime approximation would be 4.
Lecture 12, Slide 58
Running Time of a Strip-mined Loop
There are two key factors that contribute to the running time of
a strip-mined loop consisting of a sequence of convoys:
1. The number of convoys in the loop, which determines the
number of chimes. We use the notation Tchime for the
execution time in chimes.
2. The overhead for each strip-mined sequence of convoys.
This overhead consists of the cost of executing the scalar
code for strip-mining each block, Tloop, plus the vector startup cost for each convoy, Tstart.
• the total running time for a vector sequence
operating on a vector of length n:
Lecture 12, Slide 59
• Example What is the execution time on VMIPS for the vector
operation A = B × s, where s is a scalar and the length of the
vectors A and B is 200?
• Answer
– Assume the addresses of A and B are initially in Ra and
Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0
always holds 0.
– The first iteration of the strip-mined loop will execute for a
vector length of (200 mod 64) = 8 elements, and the
following iterations will execute for a vector length of 64
elements.
– Since the vector length is either 8 or 64, we increment the
address registers by 8 × 8 = 64 after the first segment and
8 × 64 = 512 for later segments. The total number of bytes
in the vector is 8 × 200 = 1600, and we test for completion
by comparing the address of the next vector segment to the
initial address plus 1600.
Here is the actual code:
Loop:
DADDUI
DADDU
DADDUI
MTC1
DADDUI
DADDUI
LV
MULVS.D
SV
DADDU
DADDU
DADDUI
MTC1
DSUBU
BNEZ
R2,R0,#1600
R2,R2,Ra
R1,R0,#8
VLR,R1
R1,R0,#64
R3,R0,#64
V1,Rb
V2,V1,Fs
Ra,V2
Ra,Ra,R1
Rb,Rb,R1
R1,R0,#512
VLR,R3
R4,R2,Ra
R4,Loop
Lecture 12, Slide 60
;total # bytes in vector
;address of the end of A vector
;loads length of 1st segment
;load vector length in VLR
;length in bytes of 1st segment
;vector length of other segments
;load B
;vector * scalar
;store A
;address of next segment of A
;address of next segment of B
;load byte offset next segment
;set length to 64 elements
;at the end of A?
;if not, go back
Lecture 12, Slide 61
• The three vector instructions in the loop are dependent and
must go into three convoys, hence Tchime = 3.
• Use our basic formula:
• The value of Tstart is given by Tstart = 12 + 7 + 12 = 31
• So, the overall value becomes T200 = 660 + 4 × 31= 784
• The execution time per element with all start-up costs is then
784/200 = 3.9, compared with a chime approximation of three.
Download