Tutorial Support – Question B

advertisement
Vector Processor Overview
2/6/2016
Tutorial Support – Question B.2 explained
Consider the following vector code run on a 200-MHz version of DLXV for a fixed
vector length of 64:
LV
MULTV
ADDV
SV
SV
V1,Ra
V2,V1,V3
V4,V1,V3
Rb,V2
Rc,V4
Ignore all strip-mining overhead, but assume that the store latency must be included in
the time to perform the loop. The entire sequence produces 64 results.
a)
Assuming no chaining and a single memory pipeline, how many chimes are
required? How many clock cycles per result (including both stores as one result)
does this vector sequence require, including start-up overhead?
Chimes give a rough estimate of the time that the vector processor is going
to take to do the calculation of the program. The bulk of the time is going to be
used by the vector contents being calculated. For a vector length of 64 pieces of
data with a result coming on every clock cycle this is going to be 64 clock cycles
plus the start-up time for the vector operation. For the first LV instruction this
would be 64 + 12 = 76. So if we use the chime representation for the timing
metric then the load vector operation would take 1 chime plus a relatively small
amount of extra overhead time. So if some instructions in the program do not
directly affect each other and are able to be executed at the same time in a convoy
arrangement then the overall timing will be the number of convoys times the
vector length and a relatively small amount of overhead cycles being added on.
In the example for this question the convoys can be as follows:
1)
LV
2)
MULTV,ADDV
3)
SV
4)
SV
So this would be 4 convoys and hence 4 chimes to execute the program. If each
piece of data takes 1 clock cycle to do an operation then the total number of clock
cycles in the estimation is 4 times 64 = 256 clock cycle. This works out to about
4 clock cycles per result. If we then add in the overhead times for the instructions
then we would get 256 + 12(load) + 7(multiply and add) + 12(store) + 12(store)
= 299 clock cycles.
To make it more straight forward and avoid making silly mistakes it is
better to represent this in the form of a table:
1
Vector Processor Overview
Convoy
--------1. LV
2. MULTV,ADDV
3. SV
4. SV
2/6/2016
Start
-----0
76
147
223
first result
-----------12
83
159
235
last result
----------75
146
222
298
Performance = 299 cycles  299/64 = 4.68 cycles per result.
b)
If the vector sequence is chained, how many clock cycles per result does this
sequence require, including overhead?
Chaining is kind of like a forwarding function for vectors. Instead of doing one
vector operation to completion before starting the next operation on a data value
that is output from the first and used as operand in the following, once the data is
able to be used it is used. Here we have only one memory pipeline so any
memory instructions will have contention for the memory pipeline and as usual
the first instruction has the priority. So the convoys can still function as above in
part A the difference is that the operations are not required to finish completely
before the next one starts, unless they are memory operations. So we will have
Convoy
--------1. LV,MULTV,ADDV
2. SV
3. SV
Start
-----0
83
159
first result
-----------19
95
171
last result
----------82
158
234
Here is a little chart to help visualize the operation a bit better.
1. LV
MULTV,
ADDV
2. SV
3. SV
64
|___12____||||||||||||||||||||||||||||||||||||||||||||||
|____7_____|||||||||||||||||||||||||||
|____6____||||||||||||||||||||||||||||
64
|___12__||||||||||||||||||||||
64
|___12__|||||||||||
This shows the initiation time of each instruction and then shows how the results come
out one after the other and get used in the next instruction after the delay of startup.
The performance here would be 235 cycles  235/64 = 3.68 cycles per result.
Note here that the convoys 1 and 2 of part A are now considered to be joined together
into a single chained convoy but that the overhead time of the convoy has also increased
to cover the two previous ones. Which makes sense because each overhead has to be
done before the final result of the last instruction is obtained. Also, the start of the
2
Vector Processor Overview
2/6/2016
MULTV and the ADDV is the same clock, in many processors there is a one cycle added
due to the bus only being able to initiate a limited number of instructions simultaneously.
So what we have now in terms of performance is 3 convoys = 3 chimes.
Performance = (12 + 7 + 64) + (12 + 64) + (12 + 64) = 235cycles  235/64 = 3.68
c)
Suppose DLXV had three memory pipelines and chaining. If there were no bank
conflicts in the accesses for the above loop, how many clock cycles are required
per result for this sequence.
With 3 memory pipelines then we don’t have any problem with waiting for the memory
to be available for this program. The chart would look like this:
1. LV,MULTV,SV
63
|___12____|_____7_____|_______12______||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,ADDV,SV
|____6_____|_______12_______|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
and the table above would become
Convoy
--------1.
Start
-----0
first result
-----------31
last result
----------94
Performance = 95 cycles  95/64 = 1.48 cycles per result.
3
Download