Vector Processor Overview 2/6/2016 Tutorial Support – Question B.2 explained Consider the following vector code run on a 200-MHz version of DLXV for a fixed vector length of 64: LV MULTV ADDV SV SV V1,Ra V2,V1,V3 V4,V1,V3 Rb,V2 Rc,V4 Ignore all strip-mining overhead, but assume that the store latency must be included in the time to perform the loop. The entire sequence produces 64 results. a) Assuming no chaining and a single memory pipeline, how many chimes are required? How many clock cycles per result (including both stores as one result) does this vector sequence require, including start-up overhead? Chimes give a rough estimate of the time that the vector processor is going to take to do the calculation of the program. The bulk of the time is going to be used by the vector contents being calculated. For a vector length of 64 pieces of data with a result coming on every clock cycle this is going to be 64 clock cycles plus the start-up time for the vector operation. For the first LV instruction this would be 64 + 12 = 76. So if we use the chime representation for the timing metric then the load vector operation would take 1 chime plus a relatively small amount of extra overhead time. So if some instructions in the program do not directly affect each other and are able to be executed at the same time in a convoy arrangement then the overall timing will be the number of convoys times the vector length and a relatively small amount of overhead cycles being added on. In the example for this question the convoys can be as follows: 1) LV 2) MULTV,ADDV 3) SV 4) SV So this would be 4 convoys and hence 4 chimes to execute the program. If each piece of data takes 1 clock cycle to do an operation then the total number of clock cycles in the estimation is 4 times 64 = 256 clock cycle. This works out to about 4 clock cycles per result. If we then add in the overhead times for the instructions then we would get 256 + 12(load) + 7(multiply and add) + 12(store) + 12(store) = 299 clock cycles. To make it more straight forward and avoid making silly mistakes it is better to represent this in the form of a table: 1 Vector Processor Overview Convoy --------1. LV 2. MULTV,ADDV 3. SV 4. SV 2/6/2016 Start -----0 76 147 223 first result -----------12 83 159 235 last result ----------75 146 222 298 Performance = 299 cycles 299/64 = 4.68 cycles per result. b) If the vector sequence is chained, how many clock cycles per result does this sequence require, including overhead? Chaining is kind of like a forwarding function for vectors. Instead of doing one vector operation to completion before starting the next operation on a data value that is output from the first and used as operand in the following, once the data is able to be used it is used. Here we have only one memory pipeline so any memory instructions will have contention for the memory pipeline and as usual the first instruction has the priority. So the convoys can still function as above in part A the difference is that the operations are not required to finish completely before the next one starts, unless they are memory operations. So we will have Convoy --------1. LV,MULTV,ADDV 2. SV 3. SV Start -----0 83 159 first result -----------19 95 171 last result ----------82 158 234 Here is a little chart to help visualize the operation a bit better. 1. LV MULTV, ADDV 2. SV 3. SV 64 |___12____|||||||||||||||||||||||||||||||||||||||||||||| |____7_____||||||||||||||||||||||||||| |____6____|||||||||||||||||||||||||||| 64 |___12__|||||||||||||||||||||| 64 |___12__||||||||||| This shows the initiation time of each instruction and then shows how the results come out one after the other and get used in the next instruction after the delay of startup. The performance here would be 235 cycles 235/64 = 3.68 cycles per result. Note here that the convoys 1 and 2 of part A are now considered to be joined together into a single chained convoy but that the overhead time of the convoy has also increased to cover the two previous ones. Which makes sense because each overhead has to be done before the final result of the last instruction is obtained. Also, the start of the 2 Vector Processor Overview 2/6/2016 MULTV and the ADDV is the same clock, in many processors there is a one cycle added due to the bus only being able to initiate a limited number of instructions simultaneously. So what we have now in terms of performance is 3 convoys = 3 chimes. Performance = (12 + 7 + 64) + (12 + 64) + (12 + 64) = 235cycles 235/64 = 3.68 c) Suppose DLXV had three memory pipelines and chaining. If there were no bank conflicts in the accesses for the above loop, how many clock cycles are required per result for this sequence. With 3 memory pipelines then we don’t have any problem with waiting for the memory to be available for this program. The chart would look like this: 1. LV,MULTV,SV 63 |___12____|_____7_____|_______12______|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ,ADDV,SV |____6_____|_______12_______||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| and the table above would become Convoy --------1. Start -----0 first result -----------31 last result ----------94 Performance = 95 cycles 95/64 = 1.48 cycles per result. 3