For the endless labs we were doing for this course, we are supposed to implement and optimize this FIR filter code (Point version). float FIR_Point_CPP(dm float *InputBuffer, pm float *FIR_Coeff, float *current){ float sum = 0.0; // Initialize sum as 0.0 for (int i=0; i<FIR_COEFF_SIZE; i++){ sum += *current * FIR_Coeff[i]; // calculate the result [08C6EF] r12=r12+1, i4=r12; [08C6F0] r1=pm(i12,m14); [08C6F1] r8=dm(m5,i4); [08C6F2] f1=f1*f8; [08C6F3] f0=f1+f0; [08C6F4] compu(r2,r12); [08C6F5] if le r12=r4; [08C6EB] [08C6EC] [08C6ED] [08C6EE] current++; //check if the current exceed the buffer size if (current >= (InputBuffer + DATABLOCKSIZE - 1)) r2=0xff; r2=r4+r2, i12=r8; r0=m5; lcntr=0x40, do (pc,0x7) until lce; current = InputBuffer; // if yes, wrap it to the buffer head } return sum; [08C6F6] i12=dm(m7,i6); [08C6F7] jump (m14,i12) (db); [08C6F8] rframe; [08C6F9] nop; } Q1: For this piece of code, do you think it is a Software Loop or Hardware Loop? (2 marks) A : Software loop Q2: How many cycle it takes to execute the for loop based on the mixed code? (4 marks) A : comp takes 2 cycles, 0x40 – 0x7 = int 57, 8 + 3 + 57 = 68 cycles Q3: When the Release Mode was turned on, we are expecting the optimizer to do all the fancy optimizing techniques (e.g. dual fetch, multifunction, unrolling, SIMD and etc.) for us. However, the mixed code stays the same. We have already identified that it is the if statement which prevents the optimization. Discuss TWO possible alternate solutions to the code, allowing the optimizer work the way we want. (6 Marks) A1 : int k = inBuffer + current; for(int j = k; j > = DATABLOCKSIZE;; ) { current = InputBuffer; } A2 : for ( ; current % DATABLOCKSIZE; ;) { current = InputBuffer; } Possible solutions: 1. The compiler might don’t like the mixed pointer and index methods for two arrays. Make both of the arrays use either pointer or index method. 2. Use ADD or OR operations to replace the if statement: current = InputBuffer + ((current - InputBuffer) & (NUMBLOCK*DATABLOCKSIZE)) 3. Use the run time C function from Analog Devices to tell the compiler to use hardware circular buffer. Pseudo-code would be good enough. currentCoeff = (float *) __builtin_circptr(currentCoeff, 1, InputBuffer, DATABLOCKSIZE); and then delete current++; if (current >= (InputBuffer + DATABLOCKSIZE - 1)) current = InputBuffer; Q4: Assume that the code was perfectly optimized, please write the optimized ASM code. How many cycle it takes for the for loop. What is the ratio of unoptimized code (from Q1) / optimized code (from Q4). (8 Marks) A : after optimization, pm,dm,* and + will be in parallel, no comp or if statement, 08C6F4 to 08C6ED maybe in one instruction, so 3 + 57 =60 cycles The possible answer would be: We are expecting the hardware loop (already enabled), hardware circular buffer, dual fetch, loop unrolling and SIMD will be enabled by the optimizer. The possible optimized ASM code would be: [08C933] [08C934] [08C935] [08C936] [08C937] [08C938] [08C939] r4=dm(i4,m4), r1=pm(i12,m12); f8=f1*f2, r2=pm(i12,0x2); lcntr=0xf, do (pc,0x2) until lce; f8=f2*f4, f12=f8+f12, r4=dm(i4,m4), r2=pm(i12,m12); f8=f2*f4, f12=f8+f12, r4=dm(i4,m4), r2=pm(i12,m12); f1=f2*f4, f2=f8+f12; f12=f1+f2; Since we are just considering the loop, the loading of S register to R register would not be considered. But we won’t deduct mark if you count them. 2 cycles for filling the pipeline, 1 cycle for setting up the hardware loop, 2 cycles for loop body, and last 2 cycles for exiting the loop. The loop loops 15 (0xf) times. 2 + 1 + 2 * 15 + 2 = 35 Ratio = 35 / 512 = 6.84%