Using Advanced RISC Machines (ARM) Ltd's RISC

advertisement
Using Advanced RISC Machines (ARM) Ltd’s RISC Microprocessor Architecture for Cost-Effective Embedded Signal
Processing Performance
Dave Walsh, Advanced RISC Machines Ltd (ARM), Cambridge, UK.
DSP Alone
Introduction
ARM has rapidly built a global reputation in the embedded
An alternative approach is to use the DSP to undertake all
tasks, including those currently done on the control processor,
microprocessor industry by delivering high performance 32as it is already doing most of the difficult processing work in
bit RISC designs with the cost effectiveness, power efficiency,
the system Although this solves some problems (mainly
robust development environments and ease of use that has
those related to developing systems with multiple processors
allowed them to replace 8- and 16-bit microprocessors in high
volume embedded designs. ARM’s products are sourced
rather than system cost or overall performance) the software
development requirements placed upon both control
through a unique partnership between ARM and the world’s
most powerful semiconductor companies - this kind of backup
processing and DSP architectures vary significantly. This has
significant implications for a DSP-only approach:
offers OEMs the modem, flexible choice of custom ASIC and
• DSP architecture evolution to date has used DSP
ASSP implementations their business environment demands.
performance as the major driving factor, not cost or a total
ARM operates at the convergence of the communication,
system requirement.
• Fixed point DSPs are very poorly supported by high level
computer and consumer markets. Signal processing is a key
language compilers.
technology in such markets. This paper describes the effective
way ARM’s RISC architecture has been used to implement
•
DSPs have poor code density for control code.
• DSPs do not efficiently support decision making or low
signal processing systems in volume consumer products. It
level bit manipulation.
examines the technical details that make the ARM architecture
Although some DSPs with enhanced microcontroller
a suitable target for DSP software, and uses real-world
examples to illustrate the benefits of this approach.
functionality are now being announced, these are largely the
result of adding enhanced bit manipulation capabilities to a
traditional DSP ‘multiply-accumulate’ architecture to allow
Delivering DSP Performance
them to address tasks such as convolutional encoding. A
Typical DSP systems contain both microcontroller and DSP
significant step has yet to be taken in terms of full
functionality. Several alternatives are available for developers
who wish to implement systems containing both control and
development environment and high level language compiler
DSP processing: use a combined microcontroller and DSP,
support before DSPs will offer an alternative to a
microprocessor.
use a DSP alone or use a microcontroller alone. The
alternatives must each be measured against the criteria by
In today’s systems there is proportionally much more control
which future solutions will be judged, namely lower cost,
higher performance and shorter time to market.
code than DSP code (sometimes as much as 30:1), making
DSPs a poor target for a single processor solution. This will
Microprocessor plus DSP
be further exacerbated when, for example, PDA-type valueThe most widely used solution in products today links a
-added applications become the norm in telephone and pager
control processing engine with a traditional DSP processor.
systems - an established architecture with good high level
language and operating system support will be needed.
Several such products are already available today from ARM
partners, for example modems and telecommunication
Additionally, more sophisticated DSP algorithms such as textto-speech and voice recognition make heavy use of RISC style
chipsets.
instructions, and only use a small amount of traditional DSP
processing capability. This further strengthens the case for
This approach is common, and using an ARM microprocessor
provides significant benefits in terms of power efficiency and
building future systems on an established, robust,
predominantly microprocessor-based architecture.
total system cost. When considered in absolute terms,
however, it has the disadvantages of requiring multiple
Using a DSP alone for a total system control and DSP
memory systems, combined with inefficient interprocessor
processing solution increases memory costs due to poor code
communication and duplicated processing functionality. This
density, lengthens time-to-market because of poor high level
results in higher costs via
language support and increases the product lifecycle costs
high total component count / silicon area.
• complex design.
incurred by low level language program maintenance.
Ž complex software development process.
Microprocessor Alone
Of the two types of processor (control and DSP) used in
These problems will remain if a two processor solution is used
systems today, the control processor seems the more likely to
in the future.
be suitable for a single ‘control plus DSP’ processor target,
particularly given the increasing importance of control code
and easy system development described above.
698
Traditional embedded microcontrollers do not have the
performance needed to perform today’s DSP tasks, and further
increases in the complexity of DSP algorithms will make them
even less suitable.
CISC processors developed for desktop applications can
achieve the required performance levels, but at a prohibitive
cost in silicon area, power consumption and code density.
Traditional RISC processors can also perform DSP tasks
because of their fast instruction cycle time While not having
fully optimum datapaths for DSP, a significant amount of
signal processing can be performed by RISC architectures
because of their basic speed. In particular, ARM processor
designs have certain architectural features (described below)
which suit them to DSP work. ARM is a proven market leader
for embedded RISC - its processors are small, very power
efficient and offer excellent code density. In addition they are
sourced in a modem, flexible manner that enables both ASSP
and ASIC solutions with the backup of some of the world’s
largest semi-conductor companies.
MIPS (defined as millions of sustained multiply-accumulate
operations per second), ARM processors offer a cost effective
solution for current and future DSP performance requirements.
The architectural features which enable this performance are
described below.
Key ARM Architectural Features for DSP
The success of ARM processors in embedded systems is due
to their small die area, good power efficiency and good code
density (for minimum system cost).
Additionally the following features significantly enhance their
DSP performance compared with alternative architectures:
Barrel Shifter
This can be used in parallel with data processing operations to
provide scaling, multiply, divide operations.
‘MUL’ instruction
A Booths multiplier (8 bits per cycle with early termination) is
incorporated into mpost designs, and instruction set support is
provided for Multiply Accumulate operations.
Auto-update load/store instructions
The ARM RISC philosophy confers excellent data addressing
support, which is ideal for building DSP data structures
Auto-update load/store multiple instructions
The ARM data addressing support also offers efficient data
transfer using a single instruction to move multiple data
words.
Fast Interrupt response
ARM uses a set of banked registers for real-time system
performance under interrupt loading. ARM also uses an onchip bus standard called AMBA which makes it easy to build
custom solutions using standard peripherals like timers, serial
ports, intra-red and PCMCIA interfaces.
Good C Compiler and development path
With an eftlcient C compiler, some DSP routines can remain
in C which speeds and eases development while minimizing
cost of ownership of the end application.
Performance of ARM Processors
ARM processors are ideal candidates for implementing DSP
systems using an embedded RISC. ARM processors are
gaining design wins in systems where they are replacing both
microprocessor and DSP functions (i.e. applications such as
cordless telephones and consumer video products which are at
the low-medium end of the DSP performance spectrum).
The processing options available from ARM microprocessor
implementations covers a range of today’s DSP performance
points. ARM’s aim is to offer a single architecture solution
for the fill range of microprocessor and DSP tasks. Part of
this solution is the Piccolo DSP co-processor described in
other parts of this conference proceedings.
Real-world examples are now provided to highlight each of
the features described above.
JPEG Digital Camera.
In this example, an ARM microprocessor was able to
outperform a microcontroller plus DSP solution while
reducing system cost and easing the development cycle. The
application is a digital still-image camera, and the key
performance criteria was to be able to compress the captured
image in a time acceptable to the user in order to minimise
system costs.
Figure 1: DSP and RISC Performance from ARM Processing
Solutions
The main stages of JPEG compression are:
• Colour conversion
– Typically RGB to YCrCb
• Downsampling
– Y is typically sampled at 1:1 resolution
– But Cr and Cb are sampled at 1:4
Figure 1 above shows ARM processing solutions, together
with the DSP and RISC MIPS they each achieve plus process
technology at which they are targetted. ARM microprocessor
solutions currently available range from 0-230 MIPS,
measured using Dhrystone 2.1 test suites. In terms of DSP
699
• Discrete Cosine Transform (DCT)
– Similar to FFT - a spatiatial frequency
conversion. Few images have a great deal of
high frequency information.
• Quantisation
– Rounds sample values to nearest
quantisation value
• Huffman compression
– Run-length encoding to compress zeros
– Uses shorter codes for common values
The second method is the optimal solution (fairly easy to find
for small values such as 105).
The ARM C-compiler supports optimized decomposing, but
restricts the amount of searching it performs in order to
minimise the impact on compilation time. The current version
of armcc has a cut-off so that it uses a normal MUL if the
number of instructions used in the multiply-by-constant
sequence exceeds some number N. This is to avoid the
sequence becoming too long. Here are some other examples:
Analysis of the operations required on an image size of 768 x
512 pixels shows that substantial DSP processing is required:
• r0=rl * 127
.
ADD r0, r1, r1, LSL #4 ; x 17
ADD r0, r0, r0, LSL #2 ; x 5 -> x 85
.
.
.
RSB r0, r1, r1, LSL #7 ; x 128 - 1 -> x127
• r0=rl * 85
Colour conversion& downsampling
– 3.5 million multiplications
Discrete Cosine Transforms (DCT)
– 0.75 million multiplications
Quantisation
– 0.6 million divisions
Huffman compression
— Bitwise compression of 0.6 million 16-bit integers
• r0=rl * 139
ADD r0, r1, r1, LSL #2 ; x 5
RSB r0, r0, r0, LSL #3 ; x 40 - x 5 -> x35
RSB r0, r0, r1, LSL #2 ; x 35 x 4 -> x140 - 1 ->
X139
All this multiplication is achieved through judicious use of the
ARM’s barrel shifter. In fact, the barrel shifter speeds up all
parts of JPEG:
• Compare sample with half a divisor (quantization)
- CMP sample, divisor, LSR #l
• Load huffman code for ‘value’
– LDR code, [dc_table, value, LSL #2]
• ‘OR’ new huffman code with existing buffer
– put_buffer 1= (code << scrap)
— ORR put_buffer, put_buffer, code, LSL shift
ARM’s Barrel Shifter, used to perform “constant”
multiplication, makes DSP like performance possible.
When multiplying by a constant value, it is possible to replace
the general multiply with a fixed sequence of adds and
subtracts which have the same effect. In many cases this can
be quicker. For instance, multiply by 5 could be achieved
using a single instruction:
Every free barrel shift represents an advantage of ARM over
other architectures. The performance of an ARM7 system on
various JPEG tasks are presented in figure 2 below.
ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5
This ADD version is obviously better than the MUL version
below:
MOV
MUL
Rs, #5
Rd, Rm, RS
The ‘cost’ of the general multiply includes the instructions
needed to load the constant into a register as well as the
multiply itself.
The difficulty in using a sequence of arithmetic instructions is
that the constant must be decomposed into a set of operations
which can be done by one instruction inch. Consider multiply
by 105:
105 == 128 - 13
== 128 - 16 + 3
== 128 - 16 + 2 + 1
ADD
SUB
ADD
Figure 2: ARM7 JPEG Performance
Rd, Rm, Rm, LSL #1; Rd = Rm*3
Rd, Rd, Rm, LSL #4; Rd = Rm*3 - Rm*16
Rd, Rd, Rm, LSL #7; Rd = Rm*3 - Rm*16 + Rm*128
Or, decomposing differently:
Cordless Telephones.
This example shows how an ARM7TDMI processor is being
used to implement DSP functions for both cordless handsets
and base stations. Examples of this use are the DECT and
105 == 15 * 7
== (16 - 1) * (8 - 1)
RSB
RSB
Rt, Rm, Rm, LSL #4; Rt = Rm*15 (tmp reg)
Rd, Rt, Rt, LSL #3; Rd = Rt*7 = Rm*105
700
PHS standards. The information presented below is gathered
from a DECT implementation.
• Non-linear processor (NLP):
This function block monitors both the line-in speech and the
speech from the handset. While the signal level from the
handset is above a defined clipping level (Vsup) it passes
through the NLP unchanged. If however the line-in signal
(Lrin) is higher than a defined level, all signals below the
clipping level Vsup are clamped to zero.
• Echo soil-suppress processor (SSP):
The echo soft-suppress processor has a very similar function
to the NLP with the exception that the signal from the handset
microphone is used as the activation signal for the SSP.
• Echo cancellation processor (ECP):
The echo cancellation process block attempts to build an
estimation of the echo paths from Line-out back to the input of
Line-In. The echo arises from impedance mismatches at the
connection of the base station to the network junction; this is
called hybrid echo and may be louder than the far-end speech
signal. There is also a smaller amount of network echo with a
longer delay which would be generated in the national
telephone network.
Typical functions required are CCITT compliant ADPCM data
conversion using the G.726 ADPCM algorithm for both
handset and base station, with ETS 300 175-8 echo
cancellation / soft suppression for the base station.
G.726 ADPCM
ITU Recommendation G.726 provides a very strictly defined
algorithm for the implementation of four different rates of data
transmission using an adaptive technique for reduction of
bandwidth requirement in the air interface of a radio telephone
system.
The ARM has a 32 bit arithmetic and logic unit and is capable
of much higher resolution than that afforded by 16 bit
processors. As the G.726 algorithm relies upon the limitations
of a 16 bit accumulator based processor, the ARM has to
perform extra functions to ensure that it discards data bits
which would be lost in a 16 bit accumulator. The use of a
floating point format in G.726 introduces arithmetical
inaccuracies which must be simulated by any implementation
of G.726 not using the exact floating point structure defined in
the specification.
The increase in performance which would be gained by
translating the NLP and SSP blocks to ARM assembler would
not be significant as a proportion of total processor load, so
the algorithms are implemented as optimised C code. The
maintainability and speed of development of such an
implementation illustrates one advantage of the efficient ARM
C compiler. The echo cancellation processor section of the
signal processing saw most effort applied in the search for
efficient and effective algorithms.
There are considerable opportunities for optimisation of the
algorithm for ARM. Initial tests of compilation of third party
C code showed that an ARM running at 50MHz would have
been required just for ADPCM conversion. Further work on
the C code to optimise from a ‘naive’ state to an architecture
‘aware’ state brought the processor requirement down to
20MHz. At this stage the code was converted to assembly
language. The final total processing requirement is
approximately 46% of the capacity of an ARM running at
20MHz with all ADPCM functions passing validation tests
supplied by the ITU consultative committee. The total
requirement for both encoding and decoding ADPCM is
approximately 9.2 MIPs, with the encoding function requiring
slightly more processing power than the decoding function.
ETS 300 175-8 Echo Cancellation
Figure 4: signal before and after echo cancellation
Figure 3: Echo Cancellation Functions
ETS 300 175-8 provides a recommendation of the
characteristics of the echo cancellation function within a
DECT base station, but gives considerably more flexibility in
implementation than the ADPCM specification.
The algorithm implemented is an LMS filter, which is
effectively a Finite Impulse Response filter with a feedback
section which allows the filter to train itself to determine the
delay and attenuation of the echo path. The LMS filter can
cope with multiple echo paths within its control time-span, in
tests eliminating both Hybrid and Network echoes
701
system clocking in order to reduce system power
consumption.
simultaneously.
The echo cancellation process makes
extensive use of the ARM’s ‘multiply accumulate’ instruction,
and has been optimised through partially unraveling the FIR
evaluation and coefficient update loop. This increases
operational speed at the expense of code size. An initial
implementation in C working on a single element for each
loop iteration required in excess of 7.7MIPs. Taking the
process of unrolling the calculation loop to its logical extent
would produce a function approximately 1850 bytes long
which could process 8KHz signals using 3.92 MIPs. The
chosen implementation is a good compromise given the rapid
increase in size for small gains in speed. The performance of
each processing section is detailed below:
A quantified estimate of the benefit of the multiplier within
the ARM7TDMI can be shown by evaluating a block which is
heavily multiply/accumulate intensive, the complex linear
equaliser function. This function is computed at the symbol
rate of 600Hz. For the purpose of evaluation, single-cycle
access memory was assumed as all of the DSP routines are
compact, leading to a high cache hit rate. Many of the
functions can be optimised to operate entirely within the ARM
register set, allowing a single multiple register load at the
entry to the function and a single multiple register store at the
The multiplier in the
completion of the function.
AMR7TDMI is shown (see figure 5) to give considerable
benefit in the DSP-intensive sections of the code. This
function type forms a significant part of the code.
Non-linear processor requires 0.264 MIPs for an 8KHz
sampling rate.
Soft-suppress processor requires 0.224 MIPs for an 8KHz
sampling rate.
Echo cancellation requires 4.048 MIPs for an 8KHz sampling
rate.
The final system implementation in hardware also makes
extensive use of ARM ‘AMBA’ peripherals to allow easy
development of a custom solution - timer interrupts are
associated with the ARM’s Fast Interrupt reQuest in the
processor, and keypress or other non-critical system events are
associated with the standard IRQ, illutsrating the suitability of
the ARM processor architecture for system implementations.
Figure 5: Performance comparisons with different multipliers
The total processing requirement for the modem code listed
above is 12MIPS with a multiplier unit. This would rise to
2lMIPS if the multiplier is not present.
Software Modem.
This example is taken from an implementation of ARM-based
modem within a multimedia Set Top Box design. The modem
code would typically be used in this application to form a slow
backchannel transmitting information such as pay-per-view
and electronic program guide requests.
Software Application Libraries
ARM has introduced a series of Software Application
Libraries which provide software components to assist in the
development of signal processing applications using ARM
processors. All the software routines described in this paper,
along with many others, are available in this library. Full
information on the library can be found on the ARM WorldWide-Web site www.arm.com.
The modem software has to compensate for variable line
noise, echo and phase jitter, providing both DSP-like and
microcontroller functionality.
A software-only implementation reduces system costs by
removing the need to use a modem chipset; in addition to the
telephone line interface which would be required in any
implementation, a single codec (ADC/DAC converter) is the
only additional component required.
Standards to be implemented in such a system are:
V.22bis, V.22, V.23, Bel1212A
V.42 error correction
V.25 call progression
DTMF transmission & dialling
Off-hook detection
A software library and peripheral hardware were developed to
implement the modem standards listed, and the system
processor was an ARM7TDMI core with a 4KB cache,
memory protection unit and AMBA peripheral bus. The
ARM7TDMI core is a fully static design and offers flexible
702
Download