SHARC02SquishDSPTalk

advertisement
Squish-DSP
Application of a Project Management Tool
to manage
low-level DSP processor resources
M. Smith,
University of Calgary, Canada
smithmr @ ucalgary.ca
Series of Talks and Workshops
CACHE-DSP – Talk on a simple process tool
to identify cache conflicts in DSP code.
SQUISH-DSP – Talk on using a project
management tool to automate identification
of parallel DSP processor instructions .
SHARC Ecology 101 – Workshop showing how
to systematically write parallel 2106X code.
SHARC Ecology 201 – Workshop on
SQUISH-DSP and CACHE-DSP tools.
Squish-DSP Tool
smithmr@ucalgary.ca
2/28
Scope of Talk
Overview of hand optimization of code
Paradigm shift in microprocessor
resource scheduling
Project Management Tool Application
Translating ‘microprocessor’ language
into a ‘business’ format
Examples and limitations
Better optimization from VisualDSP code
Future directions
Squish-DSP Tool
smithmr@ucalgary.ca
3/28
Standard “C” code
void Convert(float *temperature, int N) {
int count;
for (count = 0; count < N; count++) {
*temperature = (*temperature) * 9 / 5 + 32;
temperature++
}
Squish-DSP Tool
smithmr@ucalgary.ca
4/28
2106X-style load/store “C” code
void Convert(register float *temperature, register int N) {
register int count;
register float *pt = temperature; // Ireg <- Dreg
register float scratch;
for (count = 0; count < N; count++) {
scratch = *pt;
scratch = scratch * (9 / 5);
scratch = scratch + 32; // Order of Ops
*pt = scratch;
pt++;
}
Squish-DSP Tool
smithmr@ucalgary.ca
5/28
Check on required register use
#define count scratchR1
#define pt scratchDMpt
#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END UNTIL LDE:
scratchF2 = dm(pt, zeroDM);
Any special requirements here on F2??
// INPAR1 (R4) is dead -- can reuse
#define constantF4
F4
// Must be float
constantF4 = 1.8;
scratchF2 = scratchF2 * constantF4
Fn = F(0,1,2 or 3) * F(4,5,6 or 7),
#define F0_32 F0
// Must be float
F0_32 = 32.0;
scratchF2 = scratchF2 + F0_32;
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
LOOP_END:
dm(pt, plus1DM) = scratchF2;
Squish-DSP Tool
smithmr@ucalgary.ca
6/28
Resource Chart -- Basic code
ADDER
MULTIPLIER
DM ACCESS
PM
ACCESS
_Convert:
pt = INPAR1;
F12_32 = 32.0
// bring constants outside the loop
F4_1_8 = 1.8
LCNTR = INPAR2, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
LOOP_END:
dm(pt, PLUS1DM) = F2
5 magic lines of “C”
Time = 4 + N * 4 + 5 + 5 to do the call
Squish-DSP Tool
smithmr@ucalgary.ca
7/28
Unroll the loop -- 5 times here
ADDER
MULTIPLIER
DM ACCESS
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
Squish-DSP Tool
smithmr@ucalgary.ca
R1
M1
A1
W1
R2
M2
A2
W2
R3
M3
A3
W3
R4
M4
A4
W4
R5
M5
A5
W5
8/28
Parallelism causes Register/Resource Conflicts
ADDER
MULTIPLIER
DM ACCESS
F2 = dm(pt, ZERODM)
F8 = F2 * F4_1_8
F2 =
F2 = F8 + F12_32
F8 =
F2 =
F2 =
F8 =
NO
dm(pt, PLUS1DM) = F2
F2 = dm(pt, ZERODM)
NO
F8 = F2 * F4_1_8
F2 = F8 + F12_32
dm(pt, PLUS1DM) = F2
Squish-DSP Tool
smithmr@ucalgary.ca
SRC
Decode(Mem)
Writeback(F2)
DEST
SRC
Decode(F2,F4)
DEST
Writeback(F8)
Decode(F8,F4)
SRC
DEST
Writeback(F2)
Decode(F2)
SRC
DEST
Writeback(Mem)
Decode(Mem)
SRC
DEST
Writeback(F2)
SRC
Decode(F2,F4)
DEST
Writeback(F8)
Decode(F8,F4)
SRC
Writeback(F2)
DEST
Decode(F2)
SRC
DEST
Writeback(Mem)
9/28
Unroll the loop a bit more
ADDER
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
F9 = F8 + F12_32
MULTIPLIER
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8c
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
F8 = F2 * F4_1_8
DM ACCESS
F2 = dm(pt, ZERODM)
F2 = dm(pt, ZERODM)
R1
M1, R2
A1, M2
dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2
F2 = dm(pt, ZERODM)
R3
F2 = dm(pt, ZERODM)
M3, R4
F2 = dm(pt, ZERODM) A3, M4, R5
dm(pt, PLUS1DM) = F9 W3, A4, M5
dm(pt, PLUS1DM) = F9 W4, A5
dm(pt, PLUS1DM) = F9 W5
F2 = dm(pt, ZERODM)
R6
F2 = dm(pt, ZERODM)
M6, R7
F2 = dm(pt, ZERODM) A6, M7, R8
dm(pt, PLUS1DM) = F9 W6 A7, M8
dm(pt, PLUS1DM) = F9 W7, A8
dm(pt, PLUS1DM) = F9 W9
Squish-DSP Tool
smithmr@ucalgary.ca
10/28
Final code version
ADDER
_Convert:
MULTIPLIER
DM ACCESS
Modify(CTOPofSTACK, -1);
dm(FP, -2) = R9;
pt = INPAR1;
F12_32 = 32.0
// bring constants outside the loop
F4_1_8 = 1.8
F2 = dm(pt, ZERODM)
R1
F8 = F2 * F4_1_8
F2 = dm(pt, ZERODM)
M1, R2
F9 = F8 + F12_32
F8 = F2 * F4_1_8
A1, M2
F9 = F8 + F12_32
dm(pt, PLUS1DM) = F9
W1, A2
dm(pt, PLUS1DM) = F9
W2
LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM)
R3
F8 = F2 * F4_1_8
F2 = dm(pt, ZERODM)
M3, R4
F9 = F8 + F12_32 F8 = F2 * F4_1_8
A3, M4, R5
F2 = dm(pt, ZERODM)
F9 = F8 + F12_32
W3, A4, M5
F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9
dm(pt, PLUS1DM) = F9
W4, A5
F9 = F8 + F12_32
LOOP_END:
dm(pt, PLUS1DM) = F9
W5
R9 = dm(FP, -2);
5 magic lines of C
Squish-DSP Tool
smithmr@ucalgary.ca
11/28
Real Life
is not made up of
‘short loops’
Probably using DSP-intelligent
compiler as a starting point
Longer loops -- more tasks to make
parallel
Many different opportunities for task
ordering
Complicated resource management and
register dependency issues
Need a tool to help get the product
‘out the door’ Squish-DSP Tool
smithmr@ucalgary.ca
12/28
Business Management Tool
One evening went looking for a ‘tree’
program to manage the scheduling of
microprocessor resources.
In frustration, decided to take the 2106X tasks
and put them into Microsoft Project.
By mistake, found that I had developed a
very useful microprocessor management
tool, especially with the MS Project GUI!
Question -- how to get it to function in a
systematic manner?
Squish-DSP Tool
smithmr@ucalgary.ca
13/28
MS Project -- 21XXX processor
Requires a paradigm shift
Business project concept -- One
person can’t be doing two tasks in the
same time slot.
Becomes one data bus can’t be
transferring two data items at same time
Handled by identifying the ‘processor
resources’ needed to complete each
‘basic task’.
Squish-DSP Tool
smithmr@ucalgary.ca
14/28
MS Project -- 21XXX processor
Business project concept.
If you delay building a wall (Task A), then you
must delay painting it (Task B) HOWEVER
If you build the wall earlier, you could paint
it earlier, but you don’t have to.
Might make more sense to delay Task B so
that Task C can be done earlier
since doing Task C allows Task D to be completed
in parallel with Task B
so that the whole project is finished earlier.
Squish-DSP Tool
smithmr@ucalgary.ca
15/28
Simple Example
1) F6 = dm(I4, M4);
10) F1 = F2 * F4, F8 = F8 + F12, F12 = pm(I12, M12);
16) F5 = F3 * F6, F8 = F8 + F12, F12 = pm(I12, M12);
Might be able to move Task 1 in parallel with
any instruction 2 through 15 BUT not in
parallel with 16
If Task 10 moves earlier, so can Task 16, BUT
not before Task 10
In Task 10 ‘F12=….’ can be made parallel with
‘F6=….’, BUT Task 10 ‘F8=….’ can’t!
Squish-DSP Tool
smithmr@ucalgary.ca
16/28
SquishDSP -- parser
1) F6 = dm(I4, M4);
10) F1 = F2 * F4, F8 = F8 + F12, F12 = pm(I12, M12);
16) F5 = F3 * F6, F8 = F8 + F12, F12 = pm(I12, M12);
Task 16 split into 3 atomic tasks
F12 = pm(I12, M12) -- PMBUS resource, must come
after ‘F12=…’ from Task 10, and after ‘F8=…’ in
current Task
F8 = F8 + F12 -- ALU resource, must come after ‘F8=…’
and ‘F12=…’ from Task 10
F5 = F3 * F6 -- MULTIPLIER resource, must come
after ‘F6=…’ from Task 1
Squish-DSP Tool
smithmr@ucalgary.ca
17/28
Preparation for Microsoft Project
.asm Code broken up into sub-tasks
with intra and inter dependencies
recognized
Reformatted as Microsoft Project
Text file
Rescheduled within Microsoft Project,
either automatically or using GUI
interface
Reformatted as .asm code with
increased parallelism
Squish-DSP Tool
smithmr@ucalgary.ca
18/28
Example GUI screen capture
ATOMIC TASKS
showing RESOURCE
and DEPENDENCIES
INSTR.
Broken
into
ATOMIC
TASKS
ATOMIC TASKS
with RESOURCE
CONFLICTS
Squish-DSP Tool
smithmr@ucalgary.ca
19/28
Task scheduling after ‘LEVELING’
Squish-DSP Tool
smithmr@ucalgary.ca
20/28
Initial ‘C’ code
Squish-DSP Tool
smithmr@ucalgary.ca
21/28
Code from ‘Visual-DSP’
VisualDSP unrolled
loop
Squish-DSP
Tool 3 times
smithmr@ucalgary.ca
22/28
Code from SQUISH-DSP
12
VisualDSP
cycles
squished
to 8
Squish-DSP Tool
smithmr@ucalgary.ca
23/28
Final
version
of
code
(loop
change)
Squish-DSP Tool
smithmr@ucalgary.ca
24/28
Final
SQUISH
Squish-DSP Tool
smithmr@ucalgary.ca
12
VisualDSP
cycles
squished
to 625/28
Advantages and Limitations
Current version intended to handle the
inner critical loop of algorithm
Not handling ‘Cache’ conflicts
Not optimized for instructions in delay
slots in jumps and conditional jumps
Not optimized for multiple DAG delays
e.g. I4 = …. ; DM(I4, M2) = ; I5 =…
Moving to ‘task profile management’
macros with Primavera PV3 Tool
Squish-DSP Tool
smithmr@ucalgary.ca
26/28
Conclusion
SquishDSP is a prototype scheduling
tool to identify and reschedule
microprocessor resource operations in
parallel
Already useful in current form for
‘inner DSP loops’
Microsoft Project used for concept
work but Primavera PV3 tool offers
more long term promise
Squish-DSP Tool
smithmr@ucalgary.ca
27/28
Acknowledgements
Financial support of Natural Sciences and
Engineering Research Council (NSERC) of
Canada and University of Calgary
Financial support from Analog Devices.
Dr. Mike Smith is
ADI University Professor 2001/2002
Future financial support from
Alberta Provincial Government through
Alberta Software Engineering Research
Consortium (ASERC)
Squish-DSP Tool
smithmr@ucalgary.ca
28/28
Download