Using Advanced RISC Machines (ARM) Ltd’s RISC Microprocessor Architecture for Cost-Effective Embedded Signal Processing Performance Dave Walsh, Advanced RISC Machines Ltd (ARM), Cambridge, UK. DSP Alone Introduction ARM has rapidly built a global reputation in the embedded An alternative approach is to use the DSP to undertake all tasks, including those currently done on the control processor, microprocessor industry by delivering high performance 32as it is already doing most of the difficult processing work in bit RISC designs with the cost effectiveness, power efficiency, the system Although this solves some problems (mainly robust development environments and ease of use that has those related to developing systems with multiple processors allowed them to replace 8- and 16-bit microprocessors in high volume embedded designs. ARM’s products are sourced rather than system cost or overall performance) the software development requirements placed upon both control through a unique partnership between ARM and the world’s most powerful semiconductor companies - this kind of backup processing and DSP architectures vary significantly. This has significant implications for a DSP-only approach: offers OEMs the modem, flexible choice of custom ASIC and • DSP architecture evolution to date has used DSP ASSP implementations their business environment demands. performance as the major driving factor, not cost or a total ARM operates at the convergence of the communication, system requirement. • Fixed point DSPs are very poorly supported by high level computer and consumer markets. Signal processing is a key language compilers. technology in such markets. This paper describes the effective way ARM’s RISC architecture has been used to implement • DSPs have poor code density for control code. • DSPs do not efficiently support decision making or low signal processing systems in volume consumer products. It level bit manipulation. examines the technical details that make the ARM architecture Although some DSPs with enhanced microcontroller a suitable target for DSP software, and uses real-world examples to illustrate the benefits of this approach. functionality are now being announced, these are largely the result of adding enhanced bit manipulation capabilities to a traditional DSP ‘multiply-accumulate’ architecture to allow Delivering DSP Performance them to address tasks such as convolutional encoding. A Typical DSP systems contain both microcontroller and DSP significant step has yet to be taken in terms of full functionality. Several alternatives are available for developers who wish to implement systems containing both control and development environment and high level language compiler DSP processing: use a combined microcontroller and DSP, support before DSPs will offer an alternative to a microprocessor. use a DSP alone or use a microcontroller alone. The alternatives must each be measured against the criteria by In today’s systems there is proportionally much more control which future solutions will be judged, namely lower cost, higher performance and shorter time to market. code than DSP code (sometimes as much as 30:1), making DSPs a poor target for a single processor solution. This will Microprocessor plus DSP be further exacerbated when, for example, PDA-type valueThe most widely used solution in products today links a -added applications become the norm in telephone and pager control processing engine with a traditional DSP processor. systems - an established architecture with good high level language and operating system support will be needed. Several such products are already available today from ARM partners, for example modems and telecommunication Additionally, more sophisticated DSP algorithms such as textto-speech and voice recognition make heavy use of RISC style chipsets. instructions, and only use a small amount of traditional DSP processing capability. This further strengthens the case for This approach is common, and using an ARM microprocessor provides significant benefits in terms of power efficiency and building future systems on an established, robust, predominantly microprocessor-based architecture. total system cost. When considered in absolute terms, however, it has the disadvantages of requiring multiple Using a DSP alone for a total system control and DSP memory systems, combined with inefficient interprocessor processing solution increases memory costs due to poor code communication and duplicated processing functionality. This density, lengthens time-to-market because of poor high level results in higher costs via language support and increases the product lifecycle costs high total component count / silicon area. • complex design. incurred by low level language program maintenance. Ž complex software development process. Microprocessor Alone Of the two types of processor (control and DSP) used in These problems will remain if a two processor solution is used systems today, the control processor seems the more likely to in the future. be suitable for a single ‘control plus DSP’ processor target, particularly given the increasing importance of control code and easy system development described above. 698 Traditional embedded microcontrollers do not have the performance needed to perform today’s DSP tasks, and further increases in the complexity of DSP algorithms will make them even less suitable. CISC processors developed for desktop applications can achieve the required performance levels, but at a prohibitive cost in silicon area, power consumption and code density. Traditional RISC processors can also perform DSP tasks because of their fast instruction cycle time While not having fully optimum datapaths for DSP, a significant amount of signal processing can be performed by RISC architectures because of their basic speed. In particular, ARM processor designs have certain architectural features (described below) which suit them to DSP work. ARM is a proven market leader for embedded RISC - its processors are small, very power efficient and offer excellent code density. In addition they are sourced in a modem, flexible manner that enables both ASSP and ASIC solutions with the backup of some of the world’s largest semi-conductor companies. MIPS (defined as millions of sustained multiply-accumulate operations per second), ARM processors offer a cost effective solution for current and future DSP performance requirements. The architectural features which enable this performance are described below. Key ARM Architectural Features for DSP The success of ARM processors in embedded systems is due to their small die area, good power efficiency and good code density (for minimum system cost). Additionally the following features significantly enhance their DSP performance compared with alternative architectures: Barrel Shifter This can be used in parallel with data processing operations to provide scaling, multiply, divide operations. ‘MUL’ instruction A Booths multiplier (8 bits per cycle with early termination) is incorporated into mpost designs, and instruction set support is provided for Multiply Accumulate operations. Auto-update load/store instructions The ARM RISC philosophy confers excellent data addressing support, which is ideal for building DSP data structures Auto-update load/store multiple instructions The ARM data addressing support also offers efficient data transfer using a single instruction to move multiple data words. Fast Interrupt response ARM uses a set of banked registers for real-time system performance under interrupt loading. ARM also uses an onchip bus standard called AMBA which makes it easy to build custom solutions using standard peripherals like timers, serial ports, intra-red and PCMCIA interfaces. Good C Compiler and development path With an eftlcient C compiler, some DSP routines can remain in C which speeds and eases development while minimizing cost of ownership of the end application. Performance of ARM Processors ARM processors are ideal candidates for implementing DSP systems using an embedded RISC. ARM processors are gaining design wins in systems where they are replacing both microprocessor and DSP functions (i.e. applications such as cordless telephones and consumer video products which are at the low-medium end of the DSP performance spectrum). The processing options available from ARM microprocessor implementations covers a range of today’s DSP performance points. ARM’s aim is to offer a single architecture solution for the fill range of microprocessor and DSP tasks. Part of this solution is the Piccolo DSP co-processor described in other parts of this conference proceedings. Real-world examples are now provided to highlight each of the features described above. JPEG Digital Camera. In this example, an ARM microprocessor was able to outperform a microcontroller plus DSP solution while reducing system cost and easing the development cycle. The application is a digital still-image camera, and the key performance criteria was to be able to compress the captured image in a time acceptable to the user in order to minimise system costs. Figure 1: DSP and RISC Performance from ARM Processing Solutions The main stages of JPEG compression are: • Colour conversion – Typically RGB to YCrCb • Downsampling – Y is typically sampled at 1:1 resolution – But Cr and Cb are sampled at 1:4 Figure 1 above shows ARM processing solutions, together with the DSP and RISC MIPS they each achieve plus process technology at which they are targetted. ARM microprocessor solutions currently available range from 0-230 MIPS, measured using Dhrystone 2.1 test suites. In terms of DSP 699 • Discrete Cosine Transform (DCT) – Similar to FFT - a spatiatial frequency conversion. Few images have a great deal of high frequency information. • Quantisation – Rounds sample values to nearest quantisation value • Huffman compression – Run-length encoding to compress zeros – Uses shorter codes for common values The second method is the optimal solution (fairly easy to find for small values such as 105). The ARM C-compiler supports optimized decomposing, but restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long. Here are some other examples: Analysis of the operations required on an image size of 768 x 512 pixels shows that substantial DSP processing is required: • r0=rl * 127 . ADD r0, r1, r1, LSL #4 ; x 17 ADD r0, r0, r0, LSL #2 ; x 5 -> x 85 . . . RSB r0, r1, r1, LSL #7 ; x 128 - 1 -> x127 • r0=rl * 85 Colour conversion& downsampling – 3.5 million multiplications Discrete Cosine Transforms (DCT) – 0.75 million multiplications Quantisation – 0.6 million divisions Huffman compression — Bitwise compression of 0.6 million 16-bit integers • r0=rl * 139 ADD r0, r1, r1, LSL #2 ; x 5 RSB r0, r0, r0, LSL #3 ; x 40 - x 5 -> x35 RSB r0, r0, r1, LSL #2 ; x 35 x 4 -> x140 - 1 -> X139 All this multiplication is achieved through judicious use of the ARM’s barrel shifter. In fact, the barrel shifter speeds up all parts of JPEG: • Compare sample with half a divisor (quantization) - CMP sample, divisor, LSR #l • Load huffman code for ‘value’ – LDR code, [dc_table, value, LSL #2] • ‘OR’ new huffman code with existing buffer – put_buffer 1= (code << scrap) — ORR put_buffer, put_buffer, code, LSL shift ARM’s Barrel Shifter, used to perform “constant” multiplication, makes DSP like performance possible. When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of adds and subtracts which have the same effect. In many cases this can be quicker. For instance, multiply by 5 could be achieved using a single instruction: Every free barrel shift represents an advantage of ARM over other architectures. The performance of an ARM7 system on various JPEG tasks are presented in figure 2 below. ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5 This ADD version is obviously better than the MUL version below: MOV MUL Rs, #5 Rd, Rm, RS The ‘cost’ of the general multiply includes the instructions needed to load the constant into a register as well as the multiply itself. The difficulty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction inch. Consider multiply by 105: 105 == 128 - 13 == 128 - 16 + 3 == 128 - 16 + 2 + 1 ADD SUB ADD Figure 2: ARM7 JPEG Performance Rd, Rm, Rm, LSL #1; Rd = Rm*3 Rd, Rd, Rm, LSL #4; Rd = Rm*3 - Rm*16 Rd, Rd, Rm, LSL #7; Rd = Rm*3 - Rm*16 + Rm*128 Or, decomposing differently: Cordless Telephones. This example shows how an ARM7TDMI processor is being used to implement DSP functions for both cordless handsets and base stations. Examples of this use are the DECT and 105 == 15 * 7 == (16 - 1) * (8 - 1) RSB RSB Rt, Rm, Rm, LSL #4; Rt = Rm*15 (tmp reg) Rd, Rt, Rt, LSL #3; Rd = Rt*7 = Rm*105 700 PHS standards. The information presented below is gathered from a DECT implementation. • Non-linear processor (NLP): This function block monitors both the line-in speech and the speech from the handset. While the signal level from the handset is above a defined clipping level (Vsup) it passes through the NLP unchanged. If however the line-in signal (Lrin) is higher than a defined level, all signals below the clipping level Vsup are clamped to zero. • Echo soil-suppress processor (SSP): The echo soft-suppress processor has a very similar function to the NLP with the exception that the signal from the handset microphone is used as the activation signal for the SSP. • Echo cancellation processor (ECP): The echo cancellation process block attempts to build an estimation of the echo paths from Line-out back to the input of Line-In. The echo arises from impedance mismatches at the connection of the base station to the network junction; this is called hybrid echo and may be louder than the far-end speech signal. There is also a smaller amount of network echo with a longer delay which would be generated in the national telephone network. Typical functions required are CCITT compliant ADPCM data conversion using the G.726 ADPCM algorithm for both handset and base station, with ETS 300 175-8 echo cancellation / soft suppression for the base station. G.726 ADPCM ITU Recommendation G.726 provides a very strictly defined algorithm for the implementation of four different rates of data transmission using an adaptive technique for reduction of bandwidth requirement in the air interface of a radio telephone system. The ARM has a 32 bit arithmetic and logic unit and is capable of much higher resolution than that afforded by 16 bit processors. As the G.726 algorithm relies upon the limitations of a 16 bit accumulator based processor, the ARM has to perform extra functions to ensure that it discards data bits which would be lost in a 16 bit accumulator. The use of a floating point format in G.726 introduces arithmetical inaccuracies which must be simulated by any implementation of G.726 not using the exact floating point structure defined in the specification. The increase in performance which would be gained by translating the NLP and SSP blocks to ARM assembler would not be significant as a proportion of total processor load, so the algorithms are implemented as optimised C code. The maintainability and speed of development of such an implementation illustrates one advantage of the efficient ARM C compiler. The echo cancellation processor section of the signal processing saw most effort applied in the search for efficient and effective algorithms. There are considerable opportunities for optimisation of the algorithm for ARM. Initial tests of compilation of third party C code showed that an ARM running at 50MHz would have been required just for ADPCM conversion. Further work on the C code to optimise from a ‘naive’ state to an architecture ‘aware’ state brought the processor requirement down to 20MHz. At this stage the code was converted to assembly language. The final total processing requirement is approximately 46% of the capacity of an ARM running at 20MHz with all ADPCM functions passing validation tests supplied by the ITU consultative committee. The total requirement for both encoding and decoding ADPCM is approximately 9.2 MIPs, with the encoding function requiring slightly more processing power than the decoding function. ETS 300 175-8 Echo Cancellation Figure 4: signal before and after echo cancellation Figure 3: Echo Cancellation Functions ETS 300 175-8 provides a recommendation of the characteristics of the echo cancellation function within a DECT base station, but gives considerably more flexibility in implementation than the ADPCM specification. The algorithm implemented is an LMS filter, which is effectively a Finite Impulse Response filter with a feedback section which allows the filter to train itself to determine the delay and attenuation of the echo path. The LMS filter can cope with multiple echo paths within its control time-span, in tests eliminating both Hybrid and Network echoes 701 system clocking in order to reduce system power consumption. simultaneously. The echo cancellation process makes extensive use of the ARM’s ‘multiply accumulate’ instruction, and has been optimised through partially unraveling the FIR evaluation and coefficient update loop. This increases operational speed at the expense of code size. An initial implementation in C working on a single element for each loop iteration required in excess of 7.7MIPs. Taking the process of unrolling the calculation loop to its logical extent would produce a function approximately 1850 bytes long which could process 8KHz signals using 3.92 MIPs. The chosen implementation is a good compromise given the rapid increase in size for small gains in speed. The performance of each processing section is detailed below: A quantified estimate of the benefit of the multiplier within the ARM7TDMI can be shown by evaluating a block which is heavily multiply/accumulate intensive, the complex linear equaliser function. This function is computed at the symbol rate of 600Hz. For the purpose of evaluation, single-cycle access memory was assumed as all of the DSP routines are compact, leading to a high cache hit rate. Many of the functions can be optimised to operate entirely within the ARM register set, allowing a single multiple register load at the entry to the function and a single multiple register store at the The multiplier in the completion of the function. AMR7TDMI is shown (see figure 5) to give considerable benefit in the DSP-intensive sections of the code. This function type forms a significant part of the code. Non-linear processor requires 0.264 MIPs for an 8KHz sampling rate. Soft-suppress processor requires 0.224 MIPs for an 8KHz sampling rate. Echo cancellation requires 4.048 MIPs for an 8KHz sampling rate. The final system implementation in hardware also makes extensive use of ARM ‘AMBA’ peripherals to allow easy development of a custom solution - timer interrupts are associated with the ARM’s Fast Interrupt reQuest in the processor, and keypress or other non-critical system events are associated with the standard IRQ, illutsrating the suitability of the ARM processor architecture for system implementations. Figure 5: Performance comparisons with different multipliers The total processing requirement for the modem code listed above is 12MIPS with a multiplier unit. This would rise to 2lMIPS if the multiplier is not present. Software Modem. This example is taken from an implementation of ARM-based modem within a multimedia Set Top Box design. The modem code would typically be used in this application to form a slow backchannel transmitting information such as pay-per-view and electronic program guide requests. Software Application Libraries ARM has introduced a series of Software Application Libraries which provide software components to assist in the development of signal processing applications using ARM processors. All the software routines described in this paper, along with many others, are available in this library. Full information on the library can be found on the ARM WorldWide-Web site www.arm.com. The modem software has to compensate for variable line noise, echo and phase jitter, providing both DSP-like and microcontroller functionality. A software-only implementation reduces system costs by removing the need to use a modem chipset; in addition to the telephone line interface which would be required in any implementation, a single codec (ADC/DAC converter) is the only additional component required. Standards to be implemented in such a system are: V.22bis, V.22, V.23, Bel1212A V.42 error correction V.25 call progression DTMF transmission & dialling Off-hook detection A software library and peripheral hardware were developed to implement the modem standards listed, and the system processor was an ARM7TDMI core with a 4KB cache, memory protection unit and AMBA peripheral bus. The ARM7TDMI core is a fully static design and offers flexible 702