paper - Electrical and Computer Engineering

advertisement
Recent Advances in DSP Programmable Processors and
Compilation
C. Gebotys
Department of Electrical and Computer Engineering,
University of Waterloo, Waterloo, Ontario Canada N2L 3G1
cgebotys@optimal.vlsi.uwaterloo.ca
Abstract
Today's DSP designers until recently were faced with
the task of coding increasingly complex applications
using inefficient compilers or tedious hand-generated
assembly code. To make matters worse the products
demanded low cost, extremely high performance and
low power especially in the area of wireless
communications and consumer markets. This paper
examines the architectural features of several
popular DSP processors over the past 10 years,
hilighting code compilation for performance, price,
power. Characteristics of a recently announced DSP
processor will be examined, to show the positive
impact on the code compilation task. The goals of
this paper are to show how DSP processors have
evolved over the past decade and to present some
evidence that current architectures are much
improved. This research is important for developing
a general methodology for code compilation for
embedded DSP programmable processors with
performance, price and power constraints.
1. Introduction
Currently, there is a high demand for digital signal
processsing (DSP) processors with high performance,
low code size, low power dissipation and low energy
consumption in many areas such as the
telecommunications, information technology and
automotive industries. In particular these DSP
processors are targeting markets such as wireless
software configurable terminals or radios, third
generation wireless systems and basestations, speech
coding, speech synthesis, speech recognition, wireless
internet, mutimedia and network and data
communication. Low power consumption is important
for reliability and low cost production as well as
device portability and miniaturization. This push for
higher performance and lower power is combined
with extremely low cost requirements for
implementation of the DSP processors in embedded
applications in the consumer market. This market is
well known for being very cost sensitive and
competitive. These types of applications rely heavily
on DSP processors and every year deal with
applications which are increasing in complexity and
have shorter lifetimes. To make matters worse it is
not a straight forward process to choose a DSP
processor, write code for these new applications , and
get the product out of the door. In fact in the DSP
area, compilers are still not that popular and most
DSP programmers are still coding at the assembly
level. This greatly slows down the process of
mapping new applications onto these new DSP cores
or processor chips. Thus great difficulty lies in time
to market. The problem is that generally unlike
general purpose processors, DSP processors have
compilers which generate such inefficient code that
most designers must resort to hand coded assembly.
There are good reasons why compilers are hard to
develop for some DSP processors. However in this
millenium the use of efficient DSP compilers will be
and is a necessity. And in order to build these
compilers it will be vital to have “good” DSP
processor architectures. The premise of this paper is
that “good” DSP processor architectures have just
recently been achieved and efficient DSP compiler
tools will be prevalent. In fact it’s widely believed
that excellent VLSI technology has produced
excellent hardware and architectures, however the
next design evolution will occur at the software
level[29]. In effect allowing one hardware platform
support a number of different applications based upon
software and it’s basis/ease of design. Good software
which can be efficiently compiled and can be
efficiently implemented for applications is crucial for
embedded systems.
Previous compilers for general purpose processors
have focused on performance[30]. The evolution of
these processors moved from CISC architectures to
the RISC processors[28]. This evolution was mainly
driven by the compilers which could not efficiently
make use of many complex instructions of the CISC
architecture. RISC approaches provided simpler
instructions the compiler had an easier job and this
left more room on chip for registers. Overall
performance was easier to achieve with these
architectures[28]. The compilers for the RISC general
purpose processors typically generate instructions,
and then allocate registers.
Most of the previous work on reducing power and
energy consumption in processors has focused on
hardware solutions to the problem[27]. However,
power is so important even in the general purpose
processor area that techniques which save only a few
percentage points of power dissipation are viewed as
important[25]. It has been stated that with every new
generation of Intel processors the power problem has
increased 3 times even though the VLSI process used
has a 6 times improvement in power dissipation[25].
Embedded systems designers frequently have no
control over the hardware aspects of the pre-designed
processor cores with which they work and so,
software-based power and/or energy minimization
techniques play a useful role in meeting design
constraints.
Price is an important driving criteria for DSP
processors more so than general purpose processors.
In fact for many embedded systems applications the
size of memory, number of functional units in
datapath etc can be optimized to reduce cost for a
particular application. This support to customize a
processor core is evident in industry today[26].
However price is also reflected in code size (which
determines memory size or if external memory chips
or modules would be required) for typical DSP
applications. This is an important criteria, code size,
which also impacts power as well as price.
2. General Features DSP Processors
This section will generally highlight some major
differences of DSP and general purpose processors
(GPPs). These points arise from DSP applications
themselves together with the tight performance, price
and power constraints.
In general DSP processors in the past implement
applications which involves filters where data is
multiplied by fixed coefficients, and accumulated.
This recurrent theme of many digital filters lead to the
datapath heavily designed with multipliers and
adders. Furthermore due to tight price constraints 16
bit I/O data was typically sufficient range for voice
and other types of DSP data unlike GPPs. Data
processing proceeded with techniques which used
fixed-point data representation (often fractional data),
data scaling, and data rounding throughout the
application. This produced sufficient accuracy
without excessive cost of using 64 bit data or floating
point representation. Addressing was also quite
different from GPPs. DSP applications typically deal
with array data structures, circular buffers, etc so
these types of addressing including more exotic types
such as bit-reversed addressing (which was ideal for
efficient implementation of fast fourier transforms)
were directly supported. Some relevant functional
features of DSP processors (in addition to those in
[12,13]) which differentiate themselves from GPPs
are generally outlined below:


The DSP processor evolution will be discussed in this
paper with particular emphasis on code compilation.
The DSP processors used to illustrate the evolution
are the TI’s TMS320C2x processor[6], the TI’s
TMS320C6x processor[3], and finally to the most
recently announced Star*core 140 processor core
jointly developed by Motorola and Lucent[7]. The
paper will first give a general description of DSP
features, outlining some differences with general
purpose processors. Only fixed point DSP processors
will be discussed and not the more costly floating
point DSP processors. Next a comparison of these
three processors will be performed in general and
later proceed to detail problems and improvements
made with each architecture with respect to
compilation. Conclusions regarding DSP processor
architecture and compilation techniques for this
millenium will be provided.





Word sizes vary, 8 bit radar/microcontrollers,
16/32 bit vocoders and general DSP, etc.
Addressing memory to access arrays of data, or
circular buffers, or accessing data in a fixed
pattern such as bit reversed addressing.
Multiplying data by constant coefficient and
accumulating the results (typical filter structure).
Rounding of data
Finding the minimum or maximum of the
absolute value of two words
find leading number of 0’s or 1’s in the word (or
left most non-redundant sign bit), and use this to
determine how many bits to shift to increase
dynamic range without overflowing. This is often
called exponent or normalization and can be
applied to an array of words (block exponent
calculation).
Saturated arithmetic as opposed to arithmetic
with overflow. In saturation mode, if the
numerical result overflows/underflows just set
the result to the maximum/minimum number
which fits in the word size.
In general price pressures of DSP applications forces
compiler to be efficient with respect to price (code
size), performance, and power. Code size should be
minimized so that applications where possible can fit
solely onto the chip and external memory does not
have to be used. This impacts cost significantly in two
ways, cost of additional memory chip and power
dissipation. The increase in power dissipation
increases the cost of packaging especially when
power exceeds 30W [25]. Meeting performance
constraints are difficult since applications are
continually increasing in complexity with higher
throughput and lower latency requirements. This
impacts the compiler since full utilization of the
datapath (or functional units in the datapath) or
maximum parallelism is crucial to achieving this
bandwidth. Performance must be efficiently met. If a
larger than necessary DSP processor is selected for an
application, excess power will be unnecessarily
dissipated leading to higher costs (from packaging,
chip etc). Typically power, price and performance are
three important criterions that impact code
compilation and DSP architecture design. In this
paper we will concentrate on code compilation and
how DSP architectures impact this compilation
process with respect to these three criteria,
specifically price, performance, and power.
2.1 Introduction to 3 DSP Processors
This section will introduce the three DSP processors
of interest. General features, philosophies, and
architecture contents will be described. The
architecture will be described with datapath, register
files, and memory access units. Delay slots and other
details will be discussed in section 3.
Table 1 provides a general overview of the three
processors including their approximate year of
introduction (Rel year or release year). The
architecture type (Arch type) is provided, and a brief
look at functional units (Func units) including address
generation units (Agu) and units which perform
operations on data (Fus). Two of the C6x Fu’s also
provided Agu functions. The SC140 had additional
registers (++) for example index registers for indexed
addressing. The C6x architecture is more of a risc
type architecture, meeting 5/7 of risc characteristics
where as the SC140 meets 6/7 of cisc characteristics
both outlined in [19] for GPPs.
Rel year
Arch type
Func units
C2x
1987[12]
Single
issue, cisc
mac, alu,
1Agu
C6x
1997[3]
VLIW, risc
SC140
1999[7]
Mult issue,
some cisc
features
4 homog.
Fu, 2 Agu
2 sets of 4
non-homog
Fus
Execution
Single
Up to 8 Up to 8
set length
word
words
words
Registers
specialized 32 homog. 16 data, 16
addr/data
address,++
Table 1. General Features of 3 DSP processors.
Table 2 provides a more detailed comparison of these
three processors. In this table we try to compare the
operations per cycle provided by each processor
architecture. Although the number of (#) Fus appears
to be fewer in row 2 of table 2 for SC140 it turns out
to have higher functionality when we compare
operations per cycle. This is because the SC140
architecture has vertical as well as horizontal
parallelism, whereas the C6x only has horizontal
parallelism. As an example consider performing the
following operation: d0 = round( d1 * d2 + d0) ,
where d0,d1,d2 are data registers. More than two
instructions and cycles are required in the C6x ( a
multiply, since mac was not a supported instruction,
an addition, and a number of instructions to perform
rounding) whereas in the SC140 only one instruction
and one cycle is required using instruction: macr
d1,d2,d0 . This instruction performs a mutliplyaccumulate (mac) followed by rounding the data
result. Each Fu is composed of three interconnected
subunits (where each Fu performs up to three
sequential operations, one on each subunit). The
subunits are a multiplier-accumulator unit with
saturation capability and a bit field unit with a barrel
shifter. Note that typically rounding and scaling
methods are typically set by control registers.
Performing a number of sequential operations with
one instruction was also employed by an in-house
core developed by Mitel[24] at the time before the
C6x was announced. Essentially to balance the
latency discrepancy between memory and logic gates
on the chip, more operations could afford to be
performed while memory was being accessed thus
balancing clock periods. Furthermore this provides
better efficiency with code size as well, drastically
reducing number of instructions. However this CISC
approach may complicate the compilers task. This
will be discussed later in this paper.
#fus, #agus
C2x
C6x
SC140
1,1
6,2
4,2
#instr/cyc
1
8
6
#opns/instr 2
1,1
3,1
#opns/cyc
2
8
14
#RF,#r/RF 0,0
2,16
1,16
#units/RF
0
4,5
6
#ma/agu
1
1
1-4
Iword size
16
32
16
rword size
16
32(&16,40) 40
Table 2. Detailed Features of 3 processors.
The C2x processor was an architecture heavily
optimized for small code size, yet simple execution
where most instructions are single cycle. It provides a
number of parallel instructions including an addition
and multiplication (which was not a multiplyaccumulate but accumulated the previous product and
in parallel performed the next multiplication). With
VLSI technology at that time a faster clock period
was attained rather than implementing mac’s which
accumulate the just computed product. No data
register files were used, just specialized registers, 32
–bit accumulator, 32-bit product register and 16-bit
input to multiplier, t, register. The architecture (which
had 16-bit program and data busses) provided
separate instructions to store the upper or lower half
of the a or p registers back to memory. Since no
address fields were required for registers and address
register pointers were used to address memory
operands, this architecture was heavily optimized for
small code size. For example some instructions were
purely opcode (apac).
The C6x processor was a move towards risc,
homogeneous type of architecture compared to the
C2x, with the philosophy that simpler instructions
would be used and the compiler could then
concentrate on attaining sufficient parallelism for
high performance. Although in the GPP area, VLIW
architectures never did fly, it was believed that VLSI
technology was now ready to build these architectures
and DSP provided necessary algorithmic parallelism
inherent in most filtering type applications. It was
designed to achieve higher performance through it’s
VLIW architecture, relying heavily on compiler
support. However register files built to support the
large number of functional units had to be split into
two banks of 32-bit registers (due to speed
considerations). To further minimize interconnect
complexity between these two register files only one
32-bit data value per cycle could cross over from one
register file to the units attached to the other register
file. Long words of 40-bits were supported by
utilizing two adjacent 32-bit registers. The units in the
datapath one each side consisted of one multiplier,
one
Agu
which
also
could
perform
additions/subtractions on data registers, one alu with
shift, branch, and bit manipulation capabilities , and
finally one more alu with compare and saturation
capabilities. Some cisc instructions were supported
including two instructions, add2/sub2 (also called
subword instructions[23]), provided capability to
perform additions or subtractions on 16-bit words
within each 32-bit register. Register files were
homogeneous, so a register could serve as an address
or data or index register.
The SC140 architecture was designed with a blend of
risc and cisc empolyed only when it directly supports
DSP programmers (for example during mac’s,
macr’s, etc). A move was made towards
homogeneous Fus (each with same capabilities) in
order to support full utilization or parallelization
within the application. For example, four mac’s could
now be executed in parallel. The memory bandwidth
was increased with this architecture, in effect
balancing the datapath (which consisted of 4 alus).
Each Agu can access 4 aligned words from memory
using one move.4f instruction. It also supported
hardware
loop
execution.
Registers
were
heterogeneous, separating address, index and data
registers, yet allowing one single data register file
acessible by all Fus. Each register was 40 bits, thus
directly supporting accumulations and multiplications
with good accuracy.
3. Compilation Difficulties and Analysis
This section will outline what specific features of
each DSP processor made compilation difficult. To
trace the evolution, we’ll show how past problems
were avoided in the next generation of architectures
and highlight the new problems introduced with each
architectural feature. Difficulty in compilation is
reflected in difficulty in attaining performance, code
size, power and energy objectives. Discussion of
difficulties compiler had with utilizing functional
units, attaining parallelism, attaining code size
optimization, and other DSP issues is presented along
with empirical data and examples.
3.1 C2x Processor
The difficulties with code compilation of the C2x
processor clearly lay in the instruction set architecture
(isa) itself. The isa was extremely heterogeneous but
more importantly made the instruction selection
compiler task tightly interdependent upon the register
allocation compiler task. These two steps of
compilation traditionally are always performed
separately in a GPP code compiler[19]. Additional
problems with code compilation stemmed from tight
coupling of parallelism with instruction selection as
well, although in many cases parallelism can be
achieved through combining instructions if data has
been allocated efficiently (which was often not the
case). A second obvious problem with compilers
designed for this processor was addressing code.
Since there were no register files, memory addressing
was performed quite often even to temporarily store
and access data. Address registers could be
incremented or decremented by 1, however any other
offset was penalized by requiring extra instruction to
perform the computation or load the address register
with a new value.
In summary the compiler task was made difficult by
resulting:
 Instruction selection tightly coupled with register
allocation Address register allocation .
 Address register allocation or addressing code
generation: relied on these memory accesses
heavily because very few registers in datapth and
those which existed were for specific purposes
(ie. accumulator, input to multiplier, etc).
To further understand the first problem consider a
multiplication operation whose output is added to
another value. This type of computation involves two
operations yet there are four possiblities for this
mapping shown in figure 1. Choice of how to
implement this functionality determines what
instructions will be generated in addition to where
data is stored. For example the multiplication result
can be kept in register ‘p’ (output of multiplier), or
transferred to register ‘a’ (using additional ‘pac’
instruction), or stored back into memory (‘pac’ and
‘sacl m’) and later restored to register ‘a’ (‘lac m’
instruction), or stored in memory (‘pac’ and ‘sacl m’
and accessed later as memory operand (‘add m’).
Researchers have tried to tackle this code generation
problem using a number of techniques[24,5,14] but it
is still believed to be NP-complete problem[18].
Nevertheless a code generation technique in [20]
found that up to 2 times improvement in performance
could be attained due to propre instruction selection
alone (independent of addressing code generation)
compared to the ‘C’ compiler results. In fact this
instruction set architecture still remains a challenge
for code generation.
Figure 1: Example of instruction selection for C2x.
Efficient address generation for C2x was crucial to
efficient compilation as well. This was challenging
because only increment, decrement addressing was
efficiently supported. Other types of addressing
where a value greater than ‘1’ was needed to
increment the address register had a penalty on code
size and performance (apart from single index register
whose contents could be used as offset at no cost,
except loading of it’s value).
Statistics were collected using ‘C’ programs
describing the discrete cosine transform, fast fourier
transform, elliptical wave filter, and other typical
DSP filters compiled for the C2x. Using TI’s ‘C’
compiler the set of filter benchmarks showed that on
average 40% of all instructions were strictly
addressing computations. Researchers studied this
problem from two perspectives, one was determining
how best to store data in memory so as to ease the
address generation[5]. The other researchers
discovered a polynomial time solution for
determining optimal basic block addressing for a
given data layout [2]. This solution utilized network
flow theory to solve the address generation problem
and reduced the amount of addressing code in
applications up to 6 times over compiler generated
code. However optimal results for index-addressing,
loops, etc remains NP-complete.
3.2 C6x Processor
The previous problems of C2x , discussed in the
preceding section were removed in the C6x
architecture, specifically improvements were:
 removed address generation costs by supporting
in instruction word a 5 bit offset for post/preincre/decrementing addressing.
 Simpler risc-like instructions were supported
along with register files to largely remove the
tight coupling between instruction selection and
register allocation. More registers (register files)
and homogeneous (not specailized) registers.
Every could act as a data or address register.
Functional
Units
.L1
.S1
Register
File A
Figure 3. Penalty of dual register files in C62xx.
.M1
.D1
Memory
.D2
.M2
Register
File B
.S2
.L2
Figure 2. C6x Datapath Architecture.
Yet increased parallelism made possible by the
VLIW implementation made code compilation a
challenging job. In addition to other problems with
code compilation arising from the architecture
outlined below. New compiler problems or Compiler
task made difficult by resulting:
 VLIW: Scheduling operations for maximum
parallelism
 Dual register files
 Heterogeneous Fus: Different operations
supported by each functional unit.
 Complex delay slots from deep pipeline.
 Code size for VLIW execution.
The partitioning of register files into two sides (A and
B) made code generation difficult. This partitioning
allowed only one value from a register file on one
side to be accessed per clock cycle by functional units
on the opposite side. This placed constraints on both
code size and performance. One way to remove this
1-cycle-1-value limitation was to utilize move
instructions, ‘mv’, (if a Fu was available for
implementing this instruction) which transferred the
data from one side of the register file to a register on
the other register file side. This often creates a
disadvantage in code size as well as performance.
Consider figure 2 below where in (a) three operations
can be executed in parallel, one multiply, and two
subtracts. In (b) after register allocation is performed,
because both subtractions are accessing a different
register from opposite sides, they must be scheduled
on different cycles since there is only one bus.
Delay slots are more complex in the C6x processor.
Most instructions have zero delay slots except for the
multiply, load, and branch instructions which had
delay slots of 1,4, and 5 cycles respectively. Properly
utilizing these delay slots (not just filling them with
nops) had a significant impact on performance. For
example in figure 2 , the partial code (generated from
‘C’ compiler) on the left hand side consists of 7
instructions and executes in 10 cycles.
By
rescheduling this code on the right hand side the
performance can be improved. This code on the right
hand side requires only 6 cycles (and has 6
instructions). However the code is more complex to
analyze since the value of register A3 used in lines 3
and 4 is the old value and in line 6 register A3 holds
the new value which was successfully loaded.
Figure 4. Complexity of delay slots in C62xx.
One disadvantage to heterogeneous functional units is
that full utilization is difficult. For example if our
DSP code can execute 4 multiplies in parallel, only 2
can be scheduled in parallel, since there are only two
multiplies in the VLIW datapath and the other 6
functional units will be idle. General instruction
scheduling is now heavily constrained by feasible sets
of operations due to heterogeneous Fus.
Code size is also an important feature to examine. In
the C6x architecture, the last bit or bit ‘p’ of each
instruction is ‘1’ if that instruction is executed in
parallel with the previous instruction during cycle ‘i’
else if the ‘p’ bit is ‘0’ this instruction starts on cycle
‘i+1’. Instructions executing in parallel are called an
execution set (or packet). Eight aligned words (or 8
instructions) at a time are fetched from memory.
However an execution set cannot cross an 8-word
boundary. (The opcode determines which functional
unit will execute which instruction.) For example 3
execution sets of 4, 6, and 6 non-nop instructions
respectively will require 3 8-word aligned fetch
accesses. In this example only 16 non-nop
instructions are used, yet 24 instructions are accessed
(over three cycles, not two cycles) from memory
because the execution sets cannot cross an 8-word
boundary.
parallel in one cycle from memory (‘move.4f’).
Execution sets were allowed to cross fetch
boundaries.
Unified Program/Data Memory
PD
Multiplication of coefficents, which are heavily
utilized in most DSP filters, can be handled by a 5 bit
immediate in the ‘mpy’ insruction, however 5 bits is
rarely practical. Alternatively coefficients have to be
moved from memory into registers first (using ‘mvk’
instruction) and then it can be used in a multiply
instruction. One disadvantage here is that a register is
required for the coefficient, possibly causing a
register spill to accommodate the coefficient.
PA
A1
3.3 StarCore Processor
The sc140 DSP processor core, the most recently
announced architecture removed several problems of
the previous C6x architecture, specifically:
 No crossover busses were used: a single data
register file accessible by all Fus.
 No heterogeneous Fus: all Fus had same
functionality and capabilities.
 Code size: Execution sets could now cross fetch
boundaries
The architecture is illustrated in figure 5. The 4 Alus
are multiplier-accumulator units with saturation
capability and each has a bit field unit as well for
shifting/rounding. Connections between the datapath
and memory are 64 bits wide. For example this allows
each of 2 Agus to access 4-16bit fractional words in
D1
64
128
32
32
D2
64
32
128
Address
Generator
Register File
Program
Sequencer
2 AGU
128
Statistics from studying over 35 DSP filters compiled
from ‘C’ codes using TI’s compiler provided the
following results. The M units were used 3% per
cycle, L and S units ranging from 2 to 7% utilization
and the D2 unit was heavily used at 29% per cycle
compared to the D1 unit at 3%. The crossover busses,
transporting data from one register file to the next,
were used 5% of the time per cycle on average
(heavier usage than some Fus). The compiler was
quite conservative and delay slots were often filled
with ‘nop’s in the compiled output. A local
improvement scheduler (based on extensions to
scheduling mode in [11]) was able to improve
performance of the C6x compiled code by 50%
strictly by rescheduling instructions and not renaming
registers, indicating that the compiler not extracting
close to optimal parallelism. In summary parallelism
was restricted by heterogeneous Fus, dual register
files and delay slots.
A2
Data ALU
Register File
BMU
4 ALU
Instruction Bus
Figure 5. SC140 Architecture .
Utilizing the SC140 architecture, the three tasks,
instruction selection, scheduling, and register
allocation, were more independent, thus easing the
compilers job. Furthermore most data computation
instructions were single cycle, required no delay slots,
and could be performed by any of the four Fus in the
datapath. Thus parallelism was independent of the
type of instructions since each Fu could execute the
same set of instructions. The following issues
remained as a challenge for the Star*core compiler:
 Prefix Grouping Overhead: required only when
serial grouping could not be utilized due to :
 Upper bank of Register File: An extra word
is required in execution set if register(s)
from the upper bank of the register file were
used.
 Instruction word extensions: For example
immediate addressing required an extra word
 High memory bandwidth available: aligned
multiword memory accesses, up to 4 words per
Agu per cycle.
Two schemes, serial grouping and prefix grouping,
were used for defining execution sets in the SC140.
The first scheme is the serial scheme (similar to ‘p’
bit used in C6x, except fetch boundaries could be
crossed). As shown in figure 6 two bits in the opcode
were used to identify which instructions were to be
executed in parallel with the previous instruction.
However to further optimize on code size this mode
was supported only if lower bank registers were used
(where each would require only a 3 bit address). In
the case where one or more registers in the execution
set were from the upper bank of registers a prefix
instruction grouping was used where two extra 16-bit
words are required, one to identify upper bank
registers, the other to denote prefix grouping
information.
Instruction 1
Serial Grouping
00
Prefix word
Prefix Grouping
Instruction 1
Instruction 2
00
Instruction 2
Instruction 3
01
hidden by utilizing higher memory bandwidth
(loading in parallel more than one word at a time,
move.2f or move.4f) or by utilizing several accesses
of the same coefficient creating an overall savings in
code size. An optimization approach is described in
[31] which performs this code size reduction in
polynomial time.
Instruction 3
011
Figure 6. Instruction grouping in SC140.
Immediate addressing was supported in SC140 thus
allowing multiplication of coefficients in filters
without usage of a register. However since
instructions were 16-bit (not 32-bits like in C6x), an
extra word (instruction word extension) along with
the additional prefix word was required in the
execution set. The compiler must be careful with it’s
decision to use immediate or registered addressing.
For example, since the execution set is fixed at 8
words, a 2 word prefix plus 4 words (providing one
instruction per Alu) plus 2 words for Agus allows
100% utilization of the datapath (or full parallelism).
However if a ‘macr’ instruction uses immediate
addressing, an extra word will be required, thus one
unit out of 6 must remain idle. Furthermore if the
compiler is not efficient at using the lower register
bank and registers from the upper register bank are
also used, a 2nd unit will be forced to be idle because
of the maximum execution size of 8 words.
Figure 7 illustrates the impact of immediate
addressing on code size and indirectly on
performance. On the left hand side, the compiler
generated code is shown (instructions executed in
parallel are illustrated inbetween square brackets [])
which has 7 instructions, requiring a code size of 12
words (prefix grouping used in 2nd and 3rd cycles).
Immediate addressing is used, for example #-6554
represents an immediate value of -6554, requiring an
instruction word extension along with prefix
grouping. At most two instruction word extensions
are allowed per execution set. By choosing when to
use immediate versus registered addressing one can
often improve code size. In figure 7 the code example
shown in the middle has a savings of three instruction
word extensions and two prefix words thus reducing
code size to 8 words. This savings is possible since
serial grouping can now be used in place of prefix
grouping. In some cases this modification can even
improve performance. Examine the code shown on
right hand side of figure 7, where rescheduling has
occurred to save one cycle. Note that the actual
loading of the coefficients into the registers are not
shown in the figure. The cost of this load could be
Figure 7. Code size improvement in SC140.
Memory bandwidth must also be optimized to take
advantage of the SC140 architecture. Since each Agu
can access a maximum of 4 aligned words from
memory, careful memory layout should be performed.
Furthermore since 4 Fus are also available in the
datapath, techniques such as loop unrolling in many
cases are simple yet ideal methods for utilizing this
memory bandwidth. Consider figure 8 which
illustrates code generated in the body of the loop. On
the leftt hand side the code represents a biquad filter.
By unrolling the loop, one can change single moves,
‘move.f’, to multiple moves, ‘move.2f’, thus utilizing
the higher memory bandwidth available in this DSP
architecture. The right hand side of figure 8 illustrates
this idea by unrolling the biquad loop once. In
practise one can unroll four times, to fully utilize the
memory bandwidth. However in general loop
unrolling increases the code size, thus careful
memory layout should also be investigated to utilize
the available memory bandwidth.
The C6x by comparison improved upon the C2x
problems however suffered from dual register file
interconnection structure and heterogeneous Fus
hindering compiler performance. Empirical evidence
here showed that very low utilization of functional
units, ranging on average from 2% to 6% utilization
per cycle, and under utilized delay slots. Claims were
supported by again empirical results which showed
up to 61% improvement in performance could be
attained through utilizing complex local rescheduling
through delay slots. However even with this increased
parallelism, functional unit utilization was low.
Figure 8. Unrolling loops in sc140 for max
bandwidth.
Other features of this architecture which greatly aid
compilation are the single cycle execution of most
instructions, limited delay slots (generally zero delay
slots for all instructions except control type of
instructions) and homogeneous Fus. These
characteristics greatly aid loop pipelining and
instruction selection, both of which the compiler
performs very well.
4. Conclusions
In summary, the evolution of DSP processors has
been discussed starting with the C2x , through the
C6x and finally to the sc140. Identification of what
architectural features of the DSP processor made
compilation difficult were presented. Each claim was
backed up through empirical results or specific
examples.
In general the C2x suffered from a processor
architecture whose specialized registers created
compiler difficulties. In effect this architecture made
instruction selection and register allocation tasks
highly interdependent. These two steps generally are
done separately in compiler technology leading to
inefficiencies in code generation for the C2x.
Empirical results supported this claim, showing that
compiler-generated code could be further optimized
to provide 2Xs improvement in performance (with no
addressing code). Furthermore the C2x compiler was
not sophisticated enough to generate efficient
addressing code. Empirical results supported this 2 nd
claim indicating that on average 40% of the compilergenerated code size was solely composed of
addressing code and through use of a post-compiler
optimization technique reductions in addressing code
alone of up to 6Xs was possible.
It has been argued that many of these past problems
with architectures which make compilation difficult
have in general been eliminated with the sc140
design. Specifically the SC140 architecture supports a
single data register file, mostly single cycle execution
(limited instructions with delay slots), homogeneous
Fus, compacter code size (16 bit instructions &
execution sets which cross fetch boundaries) and
higher memory bandwidth. Those challenges
remaining for the SC140, generally impact code size
as opposed to performance which is preferrable.
Power dissipation is generally related to efficient
instruction usage, and good hardware architecture
design practices.
Other features of this new
generation of DSP processors which remain to be
utilized by compilers are special DSP addressing
modes (modular addressing ), hardware loop
execution, etc.
In conclusion, architectural features which ease
compiler design include: homogeneous Fus, single
register files, limited delay slots, and cisc only where
it makes sense for DSP programming. Current
challenges for compilers with these newer
architectures include utilization of specialized
addressing, hardware loop execution, and utilization
of available high memory bandwidth. It is clear that
future DSP architectures will be easier to compile to,
which should proliferate the usage of DSP
programming in ‘C/C++’ , in effect greatly
eliminating the DSP assembly programmers and
drastically decreasing the time to market for future
products/markets.
References
[1] M. Lee, V. Tiwari, S. Malik and M. Fujita. “Power
Analysis and Minimization Techniques for Embedded
DSP Software”. IEEE Trans. on VLSI Design, Vol.5,
No.1, March 1997, p123-135.
[2] C.Gebotys, “a minimum cost circulation approach to
DSP address code generation “, IEEE Trans on CAD,
Vol. 18, No.6, June 1999, pp 726-741.
[3] Texas Instruments. TMS320C62xx CPU and Instr. Set
Ref. Guide,. TI Inc., 1997.
[4] P.Marwedel, G.Goossens, Eds., Code Generation for
Embedded Processors, Norwell, MA, Kluwer 1995.
[5] S.Liao, S.Devadas, K.Keutzer, S.Tjiang, A.Wang
“storage assignment to decrease code size, ACM
SIGPLAN conf. Programming lang. Des. And impl.
PLDI 1995.
[6] Texas Instruments. TMS320C2x User’s Guide , TI
Inc., 1993.
[7] Motorola and Lucent, Star*core 140 Specifications,
Rev. 0.63, September 1999.
[8] Catherine H. Gebotys and Robert J. Gebotys. “An
Empirical Comparison of Algorithmic, Instruction and
Architectural Power Prediction Models for High
Performance Embedded DSP Processors”. ISLPED,
August 1998, p121-123.
[9] C.Gebotys,
R.Gebotys,
S.Wiratunga
“Power
Minimization derived from Architectural-usage of
VLIW processors”, Proceedings of Design
Automation Conference, June 2000.
[10] C.Gebotys, R.Gebotys, ”Statistically-based prediction
of power dissip. For complex Emb. Dsp Processors”,
Microprocessors & Microsystems Journal, 1999.
[11] C.Gebotys,
“Throughput-optimized architectural
synthesis”, IEEE Trans. VLSI, Sept. 1993.
[12] E.Lee, “Programmable DSP Architectures: Part I”,
IEEE ASSP Magazine, October 1988, p4-19.
[13] E.Lee, “Programmable DSP Architectures: Part II”,
IEEE ASSP Magazine, January 1989, p4-14.
[14] G.Araujo,S.Malik, M.Lee “Using register-transfer
paths in code generation for heterogeneous memoryregister architectures” DAC, 1996.
[15] R.Leupers, P. Marwedel “time-constrained code
compaction for DSPs”, ISSS 1995.
[16] S.Liao, S.Devadas, K.Keutzer,S.Tjiang, A.Wang
“code optimization techniques for embedded dsp
microprocessors”
[17] W.Lin, C.Lee, P.Chow “an optimizing compiler for
the TMS320C25 DSP chip”, ICSPAT Oct 1994, p I689-694.
[18] Garey and Johnson, Computers and Intractability,
New York- Freeman, 1979.
[19] A. Appel, Modern Compiler Implementation in C,
Cambridge University Press, 1998.
[20] C.Gebotys, “an efficient model for dsp code
generation: performance, code size, estimated energy”,
Int’l Symp on Sys Synthesis, 1997.
[21] C.Gebotys, R. Gebotys “complexities in DSP software
compilation: performance, code size, power,
retargetability” , HICSS, 1998.
[22] C. Gebotys, R. Gebotys, “Statistically-based
prediction of power dissipation for complex embedded
DSP processors”, Microprocessors and Microsystems,
Elsevier, 1999.
[23] Adve, et al. “changing interaction of compiler and
architecture”, IEEE Computer, p51-58, December
1997.
[24] Discussions about ‘Midas’ in-house processor core
with Alex Tulai, Mitel Corp, Ottawa Ont Canada,
1997.
[25] D.Singh, Intel, Manager of Microprocessors,
Personnel Communication, 1997.
[26] Telesica, www.telesica.com, DATE 2000.
[27] A.Chandrakasan, R.Broderson, Low Power Digital
CMOS design, Kluwer Aca. Pub, Dordrecht, 1995
[28] Hennessy and Patterson, Computer Architecture: a
quantitative approach, Morgan and Kauffman, 1990.
[29] I.Bolsens, Mod. Plenary: Keynote Session:
Connected, Smart Devices – Computing beyond the
desktop, DATE2000, Sigda, 2000.
[30] Y-T.S.Li, S.Malik, “Performance analysis of
embedded software using implicit path enumeration:
IEEE Trans. CAD 16(2),1997,p1477-1487.
[31] C.Gebotys “an optimized approach to immediate
versus registered addressing in sc140”, Techn. Rept.,
Dept E&CE, University of Waterloo, 2000.
Download