Coding highly parallel instructions on ADSP2106X

advertisement
Generation of highly parallel
code for 2106X processors
An introduction
Developed by M. R. Smith
Presented by S. Lei
SHARC2000 Workshop, Boston, September 2000
Background assumed
Familiarity with SHARC 2106X
architecture
Familiarity with SHARC
programmer’s model for registers
Some assembly experience
An interest in beating the compiler
in those special cases when you
need the last drop of blood out of
the CPU :-)
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
2 / 45 + B14
To be tackled
What’s causing the problem
– General limitations of instruction sets
How to recognize when you might
be coming up against SHARC
architecture limitations
A process for optimizing the SHARC
parallelism
– Example -- Temperature conversion
– Bonus if time permits
-- Average and instantaneous power
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
3 / 45 + B14
Efficient Move for 68k -- MOVEQ.L
 Want instruction to work with 1 memory FETCH
– 16 bits available to describe operation
 5 bits taken up to say MOVEQ.L
instruction and not something else
 3 bits taken up for the 8 possible
destination data registers
 ONLY 8 bits left to describe value
– Value = + 127 to - 128 -- NOTHING ELSE
– Value is sign extended to 32 bits
0 1 1 1 D D D 0P P P P P P P P
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
4 / 45 + B14
Same basic issues on SHARC
 You can’t do EVERYTHING with ALL
possible resources
 Compute/dreg<->DM/dreg<->PM
– 3 bits opcode
– 2 bits for direction of memory ops
– ONLY 12 bits available to describe 4
DAG registers
– 8 bits to describe which registers used
for destination/source
– 23 bits to describe Compute
operations
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
5 / 45 + B14
When are DSP instructions valid?
You are going to customize
– When can you use the DSP instructions?
– Most -- From Monday to Friday
– Some Only between 9:00 a.m. and 9:00 p.m.
 Check against architecture
 21k -- Parallel ops MUST be able to do this
–
–
–
–
Can it be fetched in one cycle (op-code size)
Can it be executed in one cycle (resource question)
Can it execute without conflicting with other instructions?
Then PROBABLY legal
 HOWEVER -- The designers had the final decision
and you have to live by that decision!
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
6 / 45 + B14
You can’t do parallel
Memory to UREG ops
 Note you need 8-bits to describe just one
UREG out of all possible UREGs
Dm(<addr>) = ureg
– instruction = ? Bits
– addr described in 32 bits
– UREG description needs 8 bits
 JUST enough instruction bits to allow
dm(<offset>, Ireg) = Ureg
– NOTE that maximum number of bits to
describe the offset even if offset = 1
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
7 / 45 + B14
Pipeline considerations -- REAL ISSUE!
R2 = R1 + R3, R3 = dm(I2, M2), pm(I8,M9) = R2
The R2 in R2 = R1 + R3
is not the R2 in pm(I8,M9) = R2
The R3 in R2 = R1 + R3
is not the R3 in R3 = dm(I2, M2)
--------------------------------------------------------------You can do R3 = dm(I2, M2), pm(I8,M9) = R2
but you can’t do R3 = dm(I2, M2), dm(I3,M3) = R2
even though it look like the data bus is free for
accesses at begin and end of a cycles because
it ain’t.
Memory accesses take the WHOLE cycle to
complete
Introduction to highly parallel SHARC code
3/12/2016
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
8 / 45 + B14
Compute operations
 Only 23 bits available
 Requires 1 destination and 2 sources
 ONLY work on data registers as there is not
enough instruction bits to describe 3 uregs
 R1 = R2 + R3
ALLOWED
 R2 = R3 + 2
NOT ALLOWED
 I1 = I2 + I3
NOT ALLOWED
 Compute operations can be made conditional,
and also combined with UREG to UREG moves
(instead of memory operations)
 NO PARKING BETWEEN 8:30 and 9:30 IF R IN
THE MONTH
R1 = R2 + R3 can sometimes be ILLEGAL
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
9 / 45 + B14
Under best conditions
 If instruction described the right way
– 1 data memory access (in or out) with a
REQUIRED post modification operation
possibly with a modify register containing the
value 0
– 1 program memory access (in or out)
PROVIDED that the NEXT instruction being
fetched is stored in the instruction cache
– 1 compute operation on data registers
(EXCEPT for certain multi-function
instructions with specific registers)
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
10 / 45 + B14
Introduction to PPPPIC
Professor’s Personal Process for
Parallel Instruction Coding
Basic code development -- any system
Write the “C” code for the function
void Convert(float *temperature, int N)
which converts an array of temperatures
measured in “Celsius” (Canadian Market)
to “Fahrenheit” (Tourist Trade)
 Convert the code to ADSP 21061/68K etc.
assembly code, following the standard coding
and documentation practices, or just use the
compiler to do the job for you
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
12 / 45 + B14
Standard “C” code
void Convert(float *temperature, int N) {
int count;
for (count = 0; count < N; count++) {
*temperature = (*temperature) * 9 / 5 + 32;
temperature++
}
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
13 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques -- 2106X is essentially super-scaler
RISC
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
14 / 45 + B14
21061-style load/store “C” code
void Convert(register float *temperature, register int N) {
register int count;
register float *pt = temperature;
register float scratch;
for (count = 0; count < N; count++) {
scratch = *pt;
scratch = scratch * (9 / 5);
scratch = scratch + 32;
*pt = scratch;
pt++;
}
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
15 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
16 / 45 + B14
All assembly code routines REQUIRE
 PROLOGUE
– Appropriate defines to make easy reading of
code
– Saving of non-volatile registers
 CODE BODY -- what you want to do
– Try to plan ahead for parallel operations
– Know which 21k “multi-functions” are valid
with which registers.
 EPILOGUE
– Recover non-volatile registers
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
17 / 45 + B14
Straight conversion -- PROLOGUE
// void Convert(reg float *temperature, reg int N) {
.segment/pm seg_pmco;
.global _Convert;
_Convert:
//
register int count = GARBAGE;
#define count scratchR1
//
register float *pt = temperature;
#define pt scratchDMpt
pt = INPAR1;
//
float scratch = GARBAGE;
#define scratchF2 F2
// For the CURRENT code -- no non-volatile
// registers are needed -- may not remain true
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
18 / 45 + B14
Straight conversion of BODY and EPILOGUE
//
for (count = 0; count < N; count++) {
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
//
scratch = *pt;
scratchF2 = dm(pt, 0);
// Not ++ as pt re-used
//
scratch = scratch * (9 / 5);
// INPAR1 (R4) is dead -- can reuse as F4
#define constantF4
F4
// Must be float
constantF4 = 1.8 // No division needed, Use register constant
scratchF2 = scratchF2 * constantF4;
//
scratch = scratch + 32, Register constant;
#define F0_32 F0
// Must be float
F0_32 = 32.0;
scratchF2 = scratchF2 + F0_32;
// *pt = scratch; pt++;
LOOP_END:
dm(pt, 1) = scratchF2;
5 magic lines
of code used to return -- EPILOGUE
Introduction to highly parallel SHARC code
3/12/2016
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
19 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
20 / 45 + B14
Speed rules for memory access
scratch = dm(pt, 0);
// Not ++ as to be re-used
dm(pt, 1) = scratch;
CAN’T USE
Use of constants as modifiers is not allowed -- not
enough bits in the opcode for parallel ops!
Must use Modify registers already defined
scratch = dm(pt, zeroDM);
// Not ++ as to be re-used
dm(pt, plus1DM) = scratch;
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
21 / 45 + B14
Speed rules IF you want adds and
multiplies to occur on the same line
 F1 = F2 * F3,
F4 = F5 + F6;
– Want to do as a single instruction
– Not enough bits in the opcode
• Register description 4 + 4 + 4 + 4 + 4 + 4 (bits)
• Plus how many bits for operation description?
 Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7)
 Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
– Rearrange register usage for this instruction to work
– Register description 4 + 2 + 2 + 4 + 2 + 2 (bits)
• Inconvenient rather than really limiting -- can still use
more than half of the SHARC data registers in 1 instruction
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
22 / 45 + B14
When to worry about the register assignment
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, 0); // Not ++ as to be re-used
// INPAR1 (R4) is dead -- can reuse
#define constantF4 F4 // Must be float
constantF4 = 1.8;
scratchF2 = scratchF2 * constantF4 // Parallel later
#define F0_32 F0
F0_32 = 32.0;
scratchF2 = scratchF2 + F0_32;
LOOP_END:
3/12/2016
// Must be float
// Parallel later
dm(pt, 1) = F0_32;
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
23 / 45 + B14
Check on required register use
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, zeroDM);
Any special requirements here on F2??
// INPAR1 (R4) is dead -- can reuse
#define constantF4
F4
// Must be float
constantF4 = 1.8;
scratchF2 = scratchF2 * constantF4
Fn = F(0,1,2 or 3) * F(4,5,6 or 7),
#define F0_32 F0
// Must be float
F0_32 = 32.0;
scratchF2 = scratchF2 + F0_32;
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
LOOP_END: dm(pt, plus1DM) = scratchF2;
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
24 / 45 + B14
Register re-assignment -- Step 1
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2 -- APPEARS OKAY
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse
#define constantF4
// Must be float -- APPEARS OKAY
constantF4 = 1.8;
scratchF2 = scratchF2 * constantF4 -- APPEARS OKAY
Fn = F(0,1,2 or 3) * F(4,5,6 or 7),
#define F0_32 F0
// Must be float
F0_32 = 32.0; -- WRONG to use F0
scratchF2 = scratchF2 + F0_32; -- WRONG to use F2
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
LOOP_END: dm(pt, plus1DM) = scratchF2;
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
25 / 45 + B14
Register re-assignment -- Step 2
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse
#define constantF4 F4
// Must be float
constantF4 = 1.8;
scratchF8 = scratchF2 * constantF4
FOR LATER USE answer must be in F(8, 9, 10 or 11)
#define F12_32 F12
// INPAR3 is available
F12_32 = 32.0;
scratchF2 = scratchF8 + F12_32 ;
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
LOOP_END: dm(pt, plus1DM) = scratchF2;
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
26 / 45 + B14
MOVE “CONSTANT” OPERATIONS
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse
#define constantF4 F4
// Must be float
constantF4 = 1.8;
MOVE OUTSIDE LOOP
scratchF8 = scratchF2 * constantF4
answer must be in F(8, 9, 10 or 11)
#define F12_32 F12
// INPAR3 is available
F12_32 = 32.0;
MOVE OUTSIDE LOOP
scratchF2 = scratchF8 + F12_32 ;
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
LOOP_END: dm(pt, plus1DM) = scratchF2;
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
27 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
28 / 45 + B14
Resource Chart -- Basic code
ADDER
MULTIPLIER
DM ACCESS
PM
ACCESS
_Convert:
pt = INPAR1;
F12_32 = 32.0
// bring constants outside the loop
F4_1_8 = 1.8
LCNTR = INPAR2, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
LOOP_END:
dm(pt, PLUS1DM) = F2
5 magic lines of “C”
Time = 4 + N * 4 + 5 + 5 to do the call
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
29 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
30 / 45 + B14
Un-roll the loop
Temporarily straight line your code
 Key technique for deciding where parallel
operations are possible
 Careful -- will re-roll the straight line code
later and then the number of parallel
operations in the loop is important.
 Final code may requiring different loops
coded for different values of the loop size
– Loop size N = 3p where p is an integer
–
N = 3p + 1 etc
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
31 / 45 + B14
Step 1 -- unroll the loop -- 5 times here
ADDER
MULTIPLIER
DM ACCESS
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
R1
M1
A1
W1
R2
M2
A2
W2
R3
M3
A3
W3
R4
M4
A4
W4
R5
M5
A5
W5
32 / 45 + B14
Step 2 -- Identify resource usage in SOURCE and
DESTINATION stages of the instructions
-- then try to move the instructions into compound
(super-scalar) operations
ADDER
MULTIPLIER
DM ACCESS
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
3/12/2016
Decode(Mem)
SRC
DEST
Writeback(F2)
SRC
Decode(F2,F4)
DEST
Writeback(F8)
Decode(F8,F4)
SRC
DEST
Writeback(F2)
SRC
Decode(F2)
DEST
Writeback(Mem)
Decode(Mem)
SRC
DEST
Writeback(F2)
Decode(F2,F4)
SRC
Writeback(F8)
DEST
Decode(F8,F4)
SRC
DEST
Writeback(F2)
Decode(F2)
SRC
Writeback(Mem)
DEST
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
33 / 45 + B14
Step 3 -- Carefully check what instructions can be
moved for earlier execution
ADDER
MULTIPLIER
F8 = F2 * F4_1_8
F2 = F8 + F12_32
NO
NO
F8 = F2 * F4_1_8
F2 = F8 + F12_32
3/12/2016
DM ACCESS
F2 = dm(pt, ZERODM)
SRC
Decode(Mem)
DEST
Writeback(F2)
SRC
Decode(F2,F4)
DEST
Writeback(F8)
SRC
Decode(F8,F4)
DEST
Writeback(F2)
SRC
dm(pt, PLUS1DM) = F2 Decode(F2)
DEST
Writeback(Mem)
SRC
F2 = dm(pt, ZERODM) Decode(Mem)
DEST
Writeback(F2)
SRC
Decode(F2,F4)
DEST
Writeback(F8)
SRC
Decode(F8,F4)
DEST
Writeback(F2)
SRC
dm(pt, PLUS1DM) = F2 Decode(F2)
DEST
Writeback(Mem)
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
34 / 45 + B14
Memory resource availability
 Move up F2 = dm(pt, ZERODM) from second
loop into first loop
– Okay since F2 is in use as source in one part
of the proposed compound instruction and
destination in another
F8 = F2 * F4, dm (Ix, My) = F2
 However now we have a possible conflict about
which F2 should be used for the
dm(pt, plus1DM) = F2
instruction at end of the first loop especially if
the final code is going to involve multiple loops
all intertwined and executing simultaneously
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
35 / 45 + B14
Step 3A -- What’s up, Doc?
ADDER
MULTIPLIER
DM ACCESS
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 =
F2 = F8 + F12_32
F8 =
F2 =
F2 =
F8 =
NO
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
NO
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
3/12/2016
SRC
Decode(Mem)
Writeback(F2)
DEST
SRC
Decode(F2,F4)
DEST
Writeback(F8)
Decode(F8,F4)
SRC
DEST
Writeback(F2)
Decode(F2)
SRC
DEST
Writeback(Mem)
Decode(Mem)
SRC
DEST
Writeback(F2)
SRC
Decode(F2,F4)
DEST
Writeback(F8)
SRC
Decode(F8,F4)
Writeback(F2)
DEST
Decode(F2)
SRC
DEST
Writeback(Mem)
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
36 / 45 + B14
Step 4 -- Solution -- Use F9 (after saving)
Any data destination is allowed for parallel +/*
ADDER
MULTIPLIER
F9 = F8 + F12_32
F9 = F8 + F12_32
3/12/2016
DM ACCESS
F2 = dm(pt, ZERODM)
SRC
Decode(Mem)
DEST
Writeback(F2)
SRC
F8 = F2 * F4_1_8
Decode(F2,F4)
DEST
Writeback(F8)
SRC
Decode(F8,F4)
DEST
Writeback(F9)
SRC
dm(pt, PLUS1DM) = F9 Decode(F9)
DEST
Writeback(Mem)
SRC
F2 = dm(pt, ZERODM) Decode(Mem)
DEST
Writeback(F2)
SRC
F8 = F2 * F4_1_8
Decode(F2,F4)
DEST
Writeback(F8)
SRC
Decode(F8,F4)
DEST
Writeback(F9)
SRC
dm(pt, PLUS1DM) = F9 Decode(F9)
DEST
Writeback(Mem)
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
37 / 45 + B14
Step 5 -- Faster solution than original
But no one resource is in full use
Limiting resource should be data memory access
ADDER
MULTIPLIER
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
STALL
STALL
STALL
F9 = F8 + F12_32
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
STALL
STALL
STALL
F8 = F2 * F4_1_8
DM ACCESS
F2 = dm(pt, ZERODM)
F2 = dm(pt, ZERODM)
dm(pt, PLUS1DM) = F9
dm(pt, PLUS1DM) = F9
F2 = dm(pt, ZERODM)
F2 = dm(pt, ZERODM)
STALL
dm(pt, PLUS1DM) = F9
dm(pt, PLUS1DM) = F9
F2 = dm(pt, ZERODM)
dm(pt, PLUS1DM) = F9
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
R1
M1, R2
A1, M2
W1, A2
W2
R3
M3, R4
A3, M4
W3, A4
W4
R5
M5
A5
W5
38 / 45 + B14
Step 6 -- unroll the loop a bit more
ADDER
MULTIPLIER
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
3/12/2016
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
DM ACCESS
F2 = dm(pt, ZERODM)
F2 = dm(pt, ZERODM)
R1
M1, R2
A1, M2
dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2
F2 = dm(pt, ZERODM)
R3
F2 = dm(pt, ZERODM)
M3, R4
F2 = dm(pt, ZERODM) A3, M4, R5
dm(pt, PLUS1DM) = F9 W3, A4, M5
dm(pt, PLUS1DM) = F9 W4, A5
dm(pt, PLUS1DM) = F9 W5
F2 = dm(pt, ZERODM)
R6
F2 = dm(pt, ZERODM)
M6, R7
F2 = dm(pt, ZERODM) A6, M7, R8
dm(pt, PLUS1DM) = F9 W6 A7, M8
dm(pt, PLUS1DM) = F9 W7, A8
dm(pt, PLUS1DM) = F9 W9
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
39 / 45 + B14
Now to “re-roll the loop”
 Execution involves overlapped loop
components where the loop counter has
the value p, p+1 and p+2
 Where the original loop went around N
times, there are now three stages
associated with the any “re-rolled loop”
1) Fill the ALU pipeline
2) Overlap N - 2 times around the loop
3) Empty the ALU pipeline
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
40 / 45 + B14
Step -- Final code version
ADDER
_Convert:
MULTIPLIER
DM ACCESS
Modify(CTOPofSTACK, -1);
dm(FP, -2) = R9;
pt = INPAR1;
F12_32 = 32.0
// bring constants outside the loop
F4_1_8 = 1.8
F2 = dm(pt, ZERODM)
R1
F8 = F2 * F4_1_8
F2 = dm(pt, ZERODM)
M1, R2
F9 = F8 + F12_32
F8 = F2 * F4_1_8
A1, M2
F9 = F8 + F12_32
dm(pt, PLUS1DM) = F9
W1, A2
dm(pt, PLUS1DM) = F9
W2
LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM)
R3
F8 = F2 * F4_1_8
F2 = dm(pt, ZERODM)
M3, R4
F9 = F8 + F12_32 F8 = F2 * F4_1_8
A3, M4, R5
F2 = dm(pt, ZERODM)
F9 = F8 + F12_32
W3, A4, M5
F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9
dm(pt, PLUS1DM) = F9
W4, A5
F9 = F8 + F12_32
LOOP_END:
dm(pt, PLUS1DM) = F9
W5
R9 = dm(FP, -2);
5 magic lines of C
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
41 / 45 + B14
Speed improvements
BEFORE ANY PARALLELISM WAS INTRODUCED
START
4
NOW
LOOP
+ N*4
ENTRY
+5
with 2-fold loop unfolding
START
4+7
NOW
EXIT
+5
= 14 + 4 * N
LOOP
EXIT
+ (N – 2) * 5 / 2 + 5 + 8
= 24 + 2.5 * N
ENTRY
+5
with 3-fold loop unfolding
START
4+5
LOOP
EXIT
+ (N – 2) * 6 / 3 + 5 + 1
= 16 + 2 * N
WARNING --
ENTRY
+5
Will need 3 different coding situations
N = 3p, 3p + 1, 3p + 2
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
42 / 45 + B14
Question to Ask
 We now know the final code
 Should we have made the substitution F2
to F9?
 Who cares -- do it anyway as more likely
to be necessary rather than unnecessary
in most algorithms!
– No real disadvantage since we can probably
overlap the save and recovery of the nonvolatile R9 with other instructions!
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
43 / 45 + B14
Parallelism requires
Standard Code Development
Custom Code development
– Rewrite with specialized resources
– Move to “resource chart”
– Unroll the loop
– Adjust code
– Re-roll the loop
– Check if worth the effort
• Probably NOT -- Remember that this code
runs in the middle of a lot of other code!!!!!
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
44 / 45 + B14
Resources for more detail
Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP
(DSP2000) Workshop, Hunt, Texas, October 2000.
Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000.
Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the
Minnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in Electronic
Design Magazine, September 2000.
Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the
Minnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in Electronic
Design Magazine October 2000.
Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bit
cushions? --Tthe effect of finite precision in embedded systems", based on an
SHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online.
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
45 / 45 + B14
Another example
Probably not enough time to
cover in the workshop
Calculate instantaneous and average
power of a complex signal
// short ints are 16-bit values on this machine
short int Power(short int real[ ], short int imag[ ],
short int power[ ], short int Npts) {
short int count = 0;
short int totalpower = 0;
short int re_power, im_power;
for (count = 0; count < Npts; count++) {
re_power = real[count] * real[count];
im_power = imag[count] * imag[count];
power[count] = re_power + im_power;
totalpower += re_power + im_power;
}
return (totalpower / Npts);
}
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
47 / 45 + B14
Code rewritten to provide
VisualDSP compiler
the opportunity to some parallel
optimization including using
multiple data busses
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
48 / 45 + B14
float Power(float dm *real, float pm *imag,
float dm *power, short int Npts) {
short int count = 0;
float totalpower = 0;
float re_power, im_power;
float temp;
// Following unrolled code works for Npts divisible by 2
if ( (Npts % 2) != 0 ) exit (0);
for (count = 0; count < Npts / 2; count++) {
re_power = *real++;
im_power = *imag++;
temp=re_power*re_power+im_power*im_power;
*power++ = temp;
totalpower += temp;
re_power = *real++;
im_power = *imag++;
temp=re_power*re_power+im_power*im_power;
*power++ = temp;
totalpower += temp;
}
return (totalpower / Npts);
}
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
49 / 45 + B14
r13=dm(i0,dm_one);
// real [ ] on dm
// Hardware loop
lcntr=r11, do(pc,_L$816004-1)until lce;
// Access to imag[ ] data along pm bus as wanted
r3=pm(i8,pm_one); // imag[ ] on pm
F8=F13*F13;
F12=F3*F3;
F13=F8+F12;
// Part of second part of the loop
r3=pm(i8,pm_one);
// imag[ ] on pm
F10=F10+F13;
dm(i1,dm_one)=r13;
// power[ ] on dm
r13=dm(i0,dm_one);
// real [ ] on dm
F9=F13*F13;
F14=F3*F3;
F13=F9+F14;
dm(i1,dm_one)=r13;
F10=F10+F13;
// power[ ] on dm
r13=dm(i0,dm_one);
!end loop
// real [ ] on dm
_L$816004
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
50 / 45 + B14
VisualDSP Compiler generates code
using program memory bus for data
movement, but does not do any
optimizing.
Hand optimizing can reduce these
14 lines generated by the compiler
to just 7 without getting particularly
fancy
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
51 / 45 + B14
Hand optimizing the compiler output
14 cycles reduced to 7
lcntr=r11, do(pc,_L$816004-1)until lce;
// Dual access along dm and pm data busses
r13=dm(i0,dm_one), r3=pm(i8,pm_one);
// pm_zero contains zero to surpress the auto-incrementing mode
F8=F13*F13, r1=dm(i0,dm_one), r4=pm(i8,pm_zero);
// The value in F1 must be passed over to F5 in order to
// prepare for the combined multiplication and addition
// operation
F12=F3*F3, F5 = F1;
// Accessing pm memory is an alternate approach to preparing
// for parallel multiplication and addition operations
// One cycle overhead first time round the loop.
F9=F1*F5, F13=F8+F12, r2=pm(i8,pm_one);
F14=F2*F4, F10=F10+F13, dm(i1,dm_one)=r13;
F13=F9+F14;
F10=F10+F13, dm(i1,dm_one)=r13;
!end loop
_L$816004:
Introduction to highly parallel SHARC code
3/12/2016
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
52 / 45 + B14
Process for developing custom code
 Rewrite the “C” code using “LOAD/STORE”
techniques
 Write the assembly code using a hardware loop
– Check that end of loop label is in the correct place
 REWRITE the assembly code using registers
and instructions that COULD be used in parallel
IF you could find the correct optimization
approach
 Move algorithm to “Resource Usage Chart”
 Optimize (Attempt to)
 Compare and contrast time -- include set up and
loop control time -- was it worth the effort?
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
53 / 45 + B14
Generate the resource usage chart
Here are the 7 cycles needed during
EVERY calculation of the power
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
54 / 45 + B14
2 cycles / calculation on average after pipelining
IF you ignore DM Memory Operations --
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
55 / 45 + B14
Know the processor characteristics
 Need two extra DM cycles or equivalent
 Reorder the code to give
– 1 extra DM cycle in parallel with a register to
register move
 But R1 = R2 form of operation is a
UREG to UREG move
and will not fit into the instruction
 So REPLACE UREG to UREG move with a
COMPUTE OPERATION
R1 = PASS R2
 End up with 2.5 cycles/calculation
instead of original 7
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
56 / 45 + B14
Final code -- testing a pain
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
57 / 45 + B14
More luck than judgement
Unlike the first “easier”
Temperature Conversion code, this
“hard” example actually optimizes
much more, especially in term of
overall code length.
This particular length of code
happens to work REGARDLESS of
the size of N
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
58 / 45 + B14
Resources for more detail
Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP
(DSP2000) Workshop, Hunt, Texas, October 2000.
Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000.
Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the
Minnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in Electronic
Design Magazine, September 2000.
Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus the
Minnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in Electronic
Design Magazine October 2000.
Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bit
cushions? --Tthe effect of finite precision in embedded systems", based on an
SHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online.
3/12/2016
Introduction to highly parallel SHARC code
Copyright M. Smith and S. Lei Contact smith@enel.ucalgary.ca
59 / 45 + B14
Download