Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September 2000 Background assumed Familiarity with SHARC 2106X architecture Familiarity with SHARC programmer’s model for registers Some assembly experience An interest in beating the compiler in those special cases when you need the last drop of blood out of the CPU :-) 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 2 / 45 + B14 To be tackled What’s causing the problem – General limitations of instruction sets How to recognize when you might be coming up against SHARC architecture limitations A process for optimizing the SHARC parallelism – Example -- Temperature conversion – Bonus if time permits -- Average and instantaneous power 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 3 / 45 + B14 Efficient Move for 68k -- MOVEQ.L Want instruction to work with 1 memory FETCH – 16 bits available to describe operation 5 bits taken up to say MOVEQ.L instruction and not something else 3 bits taken up for the 8 possible destination data registers ONLY 8 bits left to describe value – Value = + 127 to - 128 -- NOTHING ELSE – Value is sign extended to 32 bits 0 1 1 1 D D D 0P P P P P P P P 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 4 / 45 + B14 Same basic issues on SHARC You can’t do EVERYTHING with ALL possible resources Compute/dreg<->DM/dreg<->PM – 3 bits opcode – 2 bits for direction of memory ops – ONLY 12 bits available to describe 4 DAG registers – 8 bits to describe which registers used for destination/source – 23 bits to describe Compute operations 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 5 / 45 + B14 When are DSP instructions valid? You are going to customize – When can you use the DSP instructions? – Most -- From Monday to Friday – Some Only between 9:00 a.m. and 9:00 p.m. Check against architecture 21k -- Parallel ops MUST be able to do this – – – – Can it be fetched in one cycle (op-code size) Can it be executed in one cycle (resource question) Can it execute without conflicting with other instructions? Then PROBABLY legal HOWEVER -- The designers had the final decision and you have to live by that decision! 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 6 / 45 + B14 You can’t do parallel Memory to UREG ops Note you need 8-bits to describe just one UREG out of all possible UREGs Dm(<addr>) = ureg – instruction = ? Bits – addr described in 32 bits – UREG description needs 8 bits JUST enough instruction bits to allow dm(<offset>, Ireg) = Ureg – NOTE that maximum number of bits to describe the offset even if offset = 1 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 7 / 45 + B14 Pipeline considerations -- REAL ISSUE! R2 = R1 + R3, R3 = dm(I2, M2), pm(I8,M9) = R2 The R2 in R2 = R1 + R3 is not the R2 in pm(I8,M9) = R2 The R3 in R2 = R1 + R3 is not the R3 in R3 = dm(I2, M2) --------------------------------------------------------------You can do R3 = dm(I2, M2), pm(I8,M9) = R2 but you can’t do R3 = dm(I2, M2), dm(I3,M3) = R2 even though it look like the data bus is free for accesses at begin and end of a cycles because it ain’t. Memory accesses take the WHOLE cycle to complete Introduction to highly parallel SHARC code 3/12/2016 Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 8 / 45 + B14 Compute operations Only 23 bits available Requires 1 destination and 2 sources ONLY work on data registers as there is not enough instruction bits to describe 3 uregs R1 = R2 + R3 ALLOWED R2 = R3 + 2 NOT ALLOWED I1 = I2 + I3 NOT ALLOWED Compute operations can be made conditional, and also combined with UREG to UREG moves (instead of memory operations) NO PARKING BETWEEN 8:30 and 9:30 IF R IN THE MONTH R1 = R2 + R3 can sometimes be ILLEGAL 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 9 / 45 + B14 Under best conditions If instruction described the right way – 1 data memory access (in or out) with a REQUIRED post modification operation possibly with a modify register containing the value 0 – 1 program memory access (in or out) PROVIDED that the NEXT instruction being fetched is stored in the instruction cache – 1 compute operation on data registers (EXCEPT for certain multi-function instructions with specific registers) 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 10 / 45 + B14 Introduction to PPPPIC Professor’s Personal Process for Parallel Instruction Coding Basic code development -- any system Write the “C” code for the function void Convert(float *temperature, int N) which converts an array of temperatures measured in “Celsius” (Canadian Market) to “Fahrenheit” (Tourist Trade) Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices, or just use the compiler to do the job for you 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 12 / 45 + B14 Standard “C” code void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / 5 + 32; temperature++ } 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 13 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques -- 2106X is essentially super-scaler RISC Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 14 / 45 + B14 21061-style load/store “C” code void Convert(register float *temperature, register int N) { register int count; register float *pt = temperature; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = scratch * (9 / 5); scratch = scratch + 32; *pt = scratch; pt++; } 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 15 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 16 / 45 + B14 All assembly code routines REQUIRE PROLOGUE – Appropriate defines to make easy reading of code – Saving of non-volatile registers CODE BODY -- what you want to do – Try to plan ahead for parallel operations – Know which 21k “multi-functions” are valid with which registers. EPILOGUE – Recover non-volatile registers 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 17 / 45 + B14 Straight conversion -- PROLOGUE // void Convert(reg float *temperature, reg int N) { .segment/pm seg_pmco; .global _Convert; _Convert: // register int count = GARBAGE; #define count scratchR1 // register float *pt = temperature; #define pt scratchDMpt pt = INPAR1; // float scratch = GARBAGE; #define scratchF2 F2 // For the CURRENT code -- no non-volatile // registers are needed -- may not remain true 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 18 / 45 + B14 Straight conversion of BODY and EPILOGUE // for (count = 0; count < N; count++) { LCNTR = INPAR2, DO LOOP_END UNTIL LDE: // scratch = *pt; scratchF2 = dm(pt, 0); // Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR1 (R4) is dead -- can reuse as F4 #define constantF4 F4 // Must be float constantF4 = 1.8 // No division needed, Use register constant scratchF2 = scratchF2 * constantF4; // scratch = scratch + 32, Register constant; #define F0_32 F0 // Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; // *pt = scratch; pt++; LOOP_END: dm(pt, 1) = scratchF2; 5 magic lines of code used to return -- EPILOGUE Introduction to highly parallel SHARC code 3/12/2016 Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 19 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 20 / 45 + B14 Speed rules for memory access scratch = dm(pt, 0); // Not ++ as to be re-used dm(pt, 1) = scratch; CAN’T USE Use of constants as modifiers is not allowed -- not enough bits in the opcode for parallel ops! Must use Modify registers already defined scratch = dm(pt, zeroDM); // Not ++ as to be re-used dm(pt, plus1DM) = scratch; 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 21 / 45 + B14 Speed rules IF you want adds and multiplies to occur on the same line F1 = F2 * F3, F4 = F5 + F6; – Want to do as a single instruction – Not enough bits in the opcode • Register description 4 + 4 + 4 + 4 + 4 + 4 (bits) • Plus how many bits for operation description? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) – Rearrange register usage for this instruction to work – Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) • Inconvenient rather than really limiting -- can still use more than half of the SHARC data registers in 1 instruction 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 22 / 45 + B14 When to worry about the register assignment #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END UNTIL LDE: scratchF2 = dm(pt, 0); // Not ++ as to be re-used // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 // Parallel later #define F0_32 F0 F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; LOOP_END: 3/12/2016 // Must be float // Parallel later dm(pt, 1) = F0_32; Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 23 / 45 + B14 Check on required register use #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END UNTIL LDE: scratchF2 = dm(pt, zeroDM); Any special requirements here on F2?? // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0 // Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) LOOP_END: dm(pt, plus1DM) = scratchF2; 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 24 / 45 + B14 Register re-assignment -- Step 1 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 -- APPEARS OKAY LCNTR = INPAR2, DO LOOP_END UNTIL LDE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 // Must be float -- APPEARS OKAY constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 -- APPEARS OKAY Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0 // Must be float F0_32 = 32.0; -- WRONG to use F0 scratchF2 = scratchF2 + F0_32; -- WRONG to use F2 Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) LOOP_END: dm(pt, plus1DM) = scratchF2; 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 25 / 45 + B14 Register re-assignment -- Step 2 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END UNTIL LDE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; scratchF8 = scratchF2 * constantF4 FOR LATER USE answer must be in F(8, 9, 10 or 11) #define F12_32 F12 // INPAR3 is available F12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) LOOP_END: dm(pt, plus1DM) = scratchF2; 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 26 / 45 + B14 MOVE “CONSTANT” OPERATIONS #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END UNTIL LDE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4 // Must be float constantF4 = 1.8; MOVE OUTSIDE LOOP scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12 // INPAR3 is available F12_32 = 32.0; MOVE OUTSIDE LOOP scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) LOOP_END: dm(pt, plus1DM) = scratchF2; 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 27 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 28 / 45 + B14 Resource Chart -- Basic code ADDER MULTIPLIER DM ACCESS PM ACCESS _Convert: pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 LCNTR = INPAR2, DO LOOP_END UNTIL LCE; F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 LOOP_END: dm(pt, PLUS1DM) = F2 5 magic lines of “C” Time = 4 + N * 4 + 5 + 5 to do the call 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 29 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 30 / 45 + B14 Un-roll the loop Temporarily straight line your code Key technique for deciding where parallel operations are possible Careful -- will re-roll the straight line code later and then the number of parallel operations in the loop is important. Final code may requiring different loops coded for different values of the loop size – Loop size N = 3p where p is an integer – N = 3p + 1 etc 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 31 / 45 + B14 Step 1 -- unroll the loop -- 5 times here ADDER MULTIPLIER DM ACCESS F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca R1 M1 A1 W1 R2 M2 A2 W2 R3 M3 A3 W3 R4 M4 A4 W4 R5 M5 A5 W5 32 / 45 + B14 Step 2 -- Identify resource usage in SOURCE and DESTINATION stages of the instructions -- then try to move the instructions into compound (super-scalar) operations ADDER MULTIPLIER DM ACCESS F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 3/12/2016 Decode(Mem) SRC DEST Writeback(F2) SRC Decode(F2,F4) DEST Writeback(F8) Decode(F8,F4) SRC DEST Writeback(F2) SRC Decode(F2) DEST Writeback(Mem) Decode(Mem) SRC DEST Writeback(F2) Decode(F2,F4) SRC Writeback(F8) DEST Decode(F8,F4) SRC DEST Writeback(F2) Decode(F2) SRC Writeback(Mem) DEST Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 33 / 45 + B14 Step 3 -- Carefully check what instructions can be moved for earlier execution ADDER MULTIPLIER F8 = F2 * F4_1_8 F2 = F8 + F12_32 NO NO F8 = F2 * F4_1_8 F2 = F8 + F12_32 3/12/2016 DM ACCESS F2 = dm(pt, ZERODM) SRC Decode(Mem) DEST Writeback(F2) SRC Decode(F2,F4) DEST Writeback(F8) SRC Decode(F8,F4) DEST Writeback(F2) SRC dm(pt, PLUS1DM) = F2 Decode(F2) DEST Writeback(Mem) SRC F2 = dm(pt, ZERODM) Decode(Mem) DEST Writeback(F2) SRC Decode(F2,F4) DEST Writeback(F8) SRC Decode(F8,F4) DEST Writeback(F2) SRC dm(pt, PLUS1DM) = F2 Decode(F2) DEST Writeback(Mem) Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 34 / 45 + B14 Memory resource availability Move up F2 = dm(pt, ZERODM) from second loop into first loop – Okay since F2 is in use as source in one part of the proposed compound instruction and destination in another F8 = F2 * F4, dm (Ix, My) = F2 However now we have a possible conflict about which F2 should be used for the dm(pt, plus1DM) = F2 instruction at end of the first loop especially if the final code is going to involve multiple loops all intertwined and executing simultaneously 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 35 / 45 + B14 Step 3A -- What’s up, Doc? ADDER MULTIPLIER DM ACCESS F2 = dm(pt, ZERODM) F8 = F2 * F4_1_8 F2 = F2 = F8 + F12_32 F8 = F2 = F2 = F8 = NO dm(pt, PLUS1DM) = F2 F2 = dm(pt, ZERODM) NO F8 = F2 * F4_1_8 F2 = F8 + F12_32 dm(pt, PLUS1DM) = F2 3/12/2016 SRC Decode(Mem) Writeback(F2) DEST SRC Decode(F2,F4) DEST Writeback(F8) Decode(F8,F4) SRC DEST Writeback(F2) Decode(F2) SRC DEST Writeback(Mem) Decode(Mem) SRC DEST Writeback(F2) SRC Decode(F2,F4) DEST Writeback(F8) SRC Decode(F8,F4) Writeback(F2) DEST Decode(F2) SRC DEST Writeback(Mem) Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 36 / 45 + B14 Step 4 -- Solution -- Use F9 (after saving) Any data destination is allowed for parallel +/* ADDER MULTIPLIER F9 = F8 + F12_32 F9 = F8 + F12_32 3/12/2016 DM ACCESS F2 = dm(pt, ZERODM) SRC Decode(Mem) DEST Writeback(F2) SRC F8 = F2 * F4_1_8 Decode(F2,F4) DEST Writeback(F8) SRC Decode(F8,F4) DEST Writeback(F9) SRC dm(pt, PLUS1DM) = F9 Decode(F9) DEST Writeback(Mem) SRC F2 = dm(pt, ZERODM) Decode(Mem) DEST Writeback(F2) SRC F8 = F2 * F4_1_8 Decode(F2,F4) DEST Writeback(F8) SRC Decode(F8,F4) DEST Writeback(F9) SRC dm(pt, PLUS1DM) = F9 Decode(F9) DEST Writeback(Mem) Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 37 / 45 + B14 Step 5 -- Faster solution than original But no one resource is in full use Limiting resource should be data memory access ADDER MULTIPLIER F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 STALL STALL STALL F9 = F8 + F12_32 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 STALL STALL STALL F8 = F2 * F4_1_8 DM ACCESS F2 = dm(pt, ZERODM) F2 = dm(pt, ZERODM) dm(pt, PLUS1DM) = F9 dm(pt, PLUS1DM) = F9 F2 = dm(pt, ZERODM) F2 = dm(pt, ZERODM) STALL dm(pt, PLUS1DM) = F9 dm(pt, PLUS1DM) = F9 F2 = dm(pt, ZERODM) dm(pt, PLUS1DM) = F9 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca R1 M1, R2 A1, M2 W1, A2 W2 R3 M3, R4 A3, M4 W3, A4 W4 R5 M5 A5 W5 38 / 45 + B14 Step 6 -- unroll the loop a bit more ADDER MULTIPLIER F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 F9 = F8 + F12_32 3/12/2016 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 F8 = F2 * F4_1_8 DM ACCESS F2 = dm(pt, ZERODM) F2 = dm(pt, ZERODM) R1 M1, R2 A1, M2 dm(pt, PLUS1DM) = F9 W1, A2 dm(pt, PLUS1DM) = F9 W2 F2 = dm(pt, ZERODM) R3 F2 = dm(pt, ZERODM) M3, R4 F2 = dm(pt, ZERODM) A3, M4, R5 dm(pt, PLUS1DM) = F9 W3, A4, M5 dm(pt, PLUS1DM) = F9 W4, A5 dm(pt, PLUS1DM) = F9 W5 F2 = dm(pt, ZERODM) R6 F2 = dm(pt, ZERODM) M6, R7 F2 = dm(pt, ZERODM) A6, M7, R8 dm(pt, PLUS1DM) = F9 W6 A7, M8 dm(pt, PLUS1DM) = F9 W7, A8 dm(pt, PLUS1DM) = F9 W9 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 39 / 45 + B14 Now to “re-roll the loop” Execution involves overlapped loop components where the loop counter has the value p, p+1 and p+2 Where the original loop went around N times, there are now three stages associated with the any “re-rolled loop” 1) Fill the ALU pipeline 2) Overlap N - 2 times around the loop 3) Empty the ALU pipeline 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 40 / 45 + B14 Step -- Final code version ADDER _Convert: MULTIPLIER DM ACCESS Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 F2 = dm(pt, ZERODM) R1 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2 dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE; F2 = dm(pt, ZERODM) R3 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A3, M4, R5 F2 = dm(pt, ZERODM) F9 = F8 + F12_32 W3, A4, M5 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 dm(pt, PLUS1DM) = F9 W4, A5 F9 = F8 + F12_32 LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 41 / 45 + B14 Speed improvements BEFORE ANY PARALLELISM WAS INTRODUCED START 4 NOW LOOP + N*4 ENTRY +5 with 2-fold loop unfolding START 4+7 NOW EXIT +5 = 14 + 4 * N LOOP EXIT + (N – 2) * 5 / 2 + 5 + 8 = 24 + 2.5 * N ENTRY +5 with 3-fold loop unfolding START 4+5 LOOP EXIT + (N – 2) * 6 / 3 + 5 + 1 = 16 + 2 * N WARNING -- ENTRY +5 Will need 3 different coding situations N = 3p, 3p + 1, 3p + 2 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 42 / 45 + B14 Question to Ask We now know the final code Should we have made the substitution F2 to F9? Who cares -- do it anyway as more likely to be necessary rather than unnecessary in most algorithms! – No real disadvantage since we can probably overlap the save and recovery of the nonvolatile R9 with other instructions! 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 43 / 45 + B14 Parallelism requires Standard Code Development Custom Code development – Rewrite with specialized resources – Move to “resource chart” – Unroll the loop – Adjust code – Re-roll the loop – Check if worth the effort • Probably NOT -- Remember that this code runs in the middle of a lot of other code!!!!! 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 44 / 45 + B14 Resources for more detail Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP (DSP2000) Workshop, Hunt, Texas, October 2000. Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000. Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the Minnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in Electronic Design Magazine, September 2000. Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the Minnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in Electronic Design Magazine October 2000. Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bit cushions? --Tthe effect of finite precision in embedded systems", based on an SHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online. 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 45 / 45 + B14 Another example Probably not enough time to cover in the workshop Calculate instantaneous and average power of a complex signal // short ints are 16-bit values on this machine short int Power(short int real[ ], short int imag[ ], short int power[ ], short int Npts) { short int count = 0; short int totalpower = 0; short int re_power, im_power; for (count = 0; count < Npts; count++) { re_power = real[count] * real[count]; im_power = imag[count] * imag[count]; power[count] = re_power + im_power; totalpower += re_power + im_power; } return (totalpower / Npts); } 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 47 / 45 + B14 Code rewritten to provide VisualDSP compiler the opportunity to some parallel optimization including using multiple data busses 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 48 / 45 + B14 float Power(float dm *real, float pm *imag, float dm *power, short int Npts) { short int count = 0; float totalpower = 0; float re_power, im_power; float temp; // Following unrolled code works for Npts divisible by 2 if ( (Npts % 2) != 0 ) exit (0); for (count = 0; count < Npts / 2; count++) { re_power = *real++; im_power = *imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp; re_power = *real++; im_power = *imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp; } return (totalpower / Npts); } 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 49 / 45 + B14 r13=dm(i0,dm_one); // real [ ] on dm // Hardware loop lcntr=r11, do(pc,_L$816004-1)until lce; // Access to imag[ ] data along pm bus as wanted r3=pm(i8,pm_one); // imag[ ] on pm F8=F13*F13; F12=F3*F3; F13=F8+F12; // Part of second part of the loop r3=pm(i8,pm_one); // imag[ ] on pm F10=F10+F13; dm(i1,dm_one)=r13; // power[ ] on dm r13=dm(i0,dm_one); // real [ ] on dm F9=F13*F13; F14=F3*F3; F13=F9+F14; dm(i1,dm_one)=r13; F10=F10+F13; // power[ ] on dm r13=dm(i0,dm_one); !end loop // real [ ] on dm _L$816004 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 50 / 45 + B14 VisualDSP Compiler generates code using program memory bus for data movement, but does not do any optimizing. Hand optimizing can reduce these 14 lines generated by the compiler to just 7 without getting particularly fancy 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 51 / 45 + B14 Hand optimizing the compiler output 14 cycles reduced to 7 lcntr=r11, do(pc,_L$816004-1)until lce; // Dual access along dm and pm data busses r13=dm(i0,dm_one), r3=pm(i8,pm_one); // pm_zero contains zero to surpress the auto-incrementing mode F8=F13*F13, r1=dm(i0,dm_one), r4=pm(i8,pm_zero); // The value in F1 must be passed over to F5 in order to // prepare for the combined multiplication and addition // operation F12=F3*F3, F5 = F1; // Accessing pm memory is an alternate approach to preparing // for parallel multiplication and addition operations // One cycle overhead first time round the loop. F9=F1*F5, F13=F8+F12, r2=pm(i8,pm_one); F14=F2*F4, F10=F10+F13, dm(i1,dm_one)=r13; F13=F9+F14; F10=F10+F13, dm(i1,dm_one)=r13; !end loop _L$816004: Introduction to highly parallel SHARC code 3/12/2016 Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 52 / 45 + B14 Process for developing custom code Rewrite the “C” code using “LOAD/STORE” techniques Write the assembly code using a hardware loop – Check that end of loop label is in the correct place REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize (Attempt to) Compare and contrast time -- include set up and loop control time -- was it worth the effort? 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 53 / 45 + B14 Generate the resource usage chart Here are the 7 cycles needed during EVERY calculation of the power 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 54 / 45 + B14 2 cycles / calculation on average after pipelining IF you ignore DM Memory Operations -- 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 55 / 45 + B14 Know the processor characteristics Need two extra DM cycles or equivalent Reorder the code to give – 1 extra DM cycle in parallel with a register to register move But R1 = R2 form of operation is a UREG to UREG move and will not fit into the instruction So REPLACE UREG to UREG move with a COMPUTE OPERATION R1 = PASS R2 End up with 2.5 cycles/calculation instead of original 7 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 56 / 45 + B14 Final code -- testing a pain 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 57 / 45 + B14 More luck than judgement Unlike the first “easier” Temperature Conversion code, this “hard” example actually optimizes much more, especially in term of overall code length. This particular length of code happens to work REGARDLESS of the size of N 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 58 / 45 + B14 Resources for more detail Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP (DSP2000) Workshop, Hunt, Texas, October 2000. Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000. Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the Minnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in Electronic Design Magazine, September 2000. Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the Minnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in Electronic Design Magazine October 2000. Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bit cushions? --Tthe effect of finite precision in embedded systems", based on an SHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online. 3/12/2016 Introduction to highly parallel SHARC code Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca 59 / 45 + B14