if statement

advertisement
For the endless labs we were doing for this course, we are supposed to
implement and optimize this FIR filter code (Point version).
float FIR_Point_CPP(dm float *InputBuffer, pm float *FIR_Coeff, float *current){
float sum = 0.0;
// Initialize sum as 0.0
for (int i=0; i<FIR_COEFF_SIZE; i++){
sum += *current * FIR_Coeff[i]; // calculate the result
[08C6EF] r12=r12+1, i4=r12;
[08C6F0] r1=pm(i12,m14);
[08C6F1] r8=dm(m5,i4);
[08C6F2] f1=f1*f8;
[08C6F3] f0=f1+f0;
[08C6F4] compu(r2,r12);
[08C6F5] if le r12=r4;
[08C6EB]
[08C6EC]
[08C6ED]
[08C6EE]
current++;
//check if the current exceed the buffer size
if (current >= (InputBuffer + DATABLOCKSIZE - 1))
r2=0xff;
r2=r4+r2, i12=r8;
r0=m5;
lcntr=0x40, do (pc,0x7) until lce;
current = InputBuffer;
// if yes, wrap it to the buffer head
}
return sum;
[08C6F6] i12=dm(m7,i6);
[08C6F7] jump (m14,i12) (db);
[08C6F8] rframe;
[08C6F9] nop;
}
Q1: For this piece of code, do you think it is a Software Loop or Hardware Loop?
(2 marks)
A : Software loop
Q2: How many cycle it takes to execute the for loop based on the mixed code? (4
marks)
A : comp takes 2 cycles, 0x40 – 0x7 = int 57, 8 + 3 + 57 = 68 cycles
Q3: When the Release Mode was turned on, we are expecting the optimizer to do
all the fancy optimizing techniques (e.g. dual fetch, multifunction, unrolling, SIMD
and etc.) for us. However, the mixed code stays the same. We have already
identified that it is the if statement which prevents the optimization. Discuss TWO
possible alternate solutions to the code, allowing the optimizer work the way we
want. (6 Marks)
A1 :
int k = inBuffer + current;
for(int j = k; j > = DATABLOCKSIZE;; )
{
current = InputBuffer;
}
A2 :
for ( ; current % DATABLOCKSIZE; ;)
{
current = InputBuffer;
}
Possible solutions:
1. The compiler might don’t like the mixed pointer and index methods for two
arrays. Make both of the arrays use either pointer or index method.
2. Use ADD or OR operations to replace the if statement:
current = InputBuffer + ((current - InputBuffer) &
(NUMBLOCK*DATABLOCKSIZE))
3. Use the run time C function from Analog Devices to tell the compiler to use
hardware circular buffer. Pseudo-code would be good enough.
currentCoeff = (float *) __builtin_circptr(currentCoeff, 1, InputBuffer,
DATABLOCKSIZE);
and then delete
current++;
if (current >= (InputBuffer + DATABLOCKSIZE - 1))
current = InputBuffer;
Q4: Assume that the code was perfectly optimized, please write the optimized
ASM code. How many cycle it takes for the for loop. What is the ratio of unoptimized code (from Q1) / optimized code (from Q4). (8 Marks)
A : after optimization, pm,dm,* and + will be in parallel, no comp or if statement,
08C6F4 to 08C6ED maybe in one instruction, so 3 + 57 =60 cycles
The possible answer would be:
We are expecting the hardware loop (already enabled), hardware circular buffer,
dual fetch, loop unrolling and SIMD will be enabled by the optimizer. The possible
optimized ASM code would be:
[08C933]
[08C934]
[08C935]
[08C936]
[08C937]
[08C938]
[08C939]
r4=dm(i4,m4), r1=pm(i12,m12);
f8=f1*f2, r2=pm(i12,0x2);
lcntr=0xf, do (pc,0x2) until lce;
f8=f2*f4, f12=f8+f12, r4=dm(i4,m4), r2=pm(i12,m12);
f8=f2*f4, f12=f8+f12, r4=dm(i4,m4), r2=pm(i12,m12);
f1=f2*f4, f2=f8+f12;
f12=f1+f2;
Since we are just considering the loop, the loading of S register to R register
would not be considered. But we won’t deduct mark if you count them.
2 cycles for filling the pipeline, 1 cycle for setting up the hardware loop, 2 cycles
for loop body, and last 2 cycles for exiting the loop. The loop loops 15 (0xf) times.
2 + 1 + 2 * 15 + 2 = 35
Ratio = 35 / 512 = 6.84%
Download