paper - Electrical and Computer Engineering

Recent Advances in DSP Programmable Processors and Compilation C. Gebotys Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario Canada N2L 3G1 cgebotys@optimal.vlsi.uwaterloo.ca Abstract Today's DSP designers until recently were faced with the task of coding increasingly complex applications using inefficient compilers or tedious hand-generated assembly code. To make matters worse the products demanded low cost, extremely high performance and low power especially in the area of wireless communications and consumer markets. This paper examines the architectural features of several popular DSP processors over the past 10 years, hilighting code compilation for performance, price, power. Characteristics of a recently announced DSP processor will be examined, to show the positive impact on the code compilation task. The goals of this paper are to show how DSP processors have evolved over the past decade and to present some evidence that current architectures are much improved. This research is important for developing a general methodology for code compilation for embedded DSP programmable processors with performance, price and power constraints. 1. Introduction Currently, there is a high demand for digital signal processsing (DSP) processors with high performance, low code size, low power dissipation and low energy consumption in many areas such as the telecommunications, information technology and automotive industries. In particular these DSP processors are targeting markets such as wireless software configurable terminals or radios, third generation wireless systems and basestations, speech coding, speech synthesis, speech recognition, wireless internet, mutimedia and network and data communication. Low power consumption is important for reliability and low cost production as well as device portability and miniaturization. This push for higher performance and lower power is combined with extremely low cost requirements for implementation of the DSP processors in embedded applications in the consumer market. This market is well known for being very cost sensitive and competitive. These types of applications rely heavily on DSP processors and every year deal with applications which are increasing in complexity and have shorter lifetimes. To make matters worse it is not a straight forward process to choose a DSP processor, write code for these new applications , and get the product out of the door. In fact in the DSP area, compilers are still not that popular and most DSP programmers are still coding at the assembly level. This greatly slows down the process of mapping new applications onto these new DSP cores or processor chips. Thus great difficulty lies in time to market. The problem is that generally unlike general purpose processors, DSP processors have compilers which generate such inefficient code that most designers must resort to hand coded assembly. There are good reasons why compilers are hard to develop for some DSP processors. However in this millenium the use of efficient DSP compilers will be and is a necessity. And in order to build these compilers it will be vital to have “good” DSP processor architectures. The premise of this paper is that “good” DSP processor architectures have just recently been achieved and efficient DSP compiler tools will be prevalent. In fact it’s widely believed that excellent VLSI technology has produced excellent hardware and architectures, however the next design evolution will occur at the software level[29]. In effect allowing one hardware platform support a number of different applications based upon software and it’s basis/ease of design. Good software which can be efficiently compiled and can be efficiently implemented for applications is crucial for embedded systems. Previous compilers for general purpose processors have focused on performance[30]. The evolution of these processors moved from CISC architectures to the RISC processors[28]. This evolution was mainly driven by the compilers which could not efficiently make use of many complex instructions of the CISC architecture. RISC approaches provided simpler instructions the compiler had an easier job and this left more room on chip for registers. Overall performance was easier to achieve with these architectures[28]. The compilers for the RISC general purpose processors typically generate instructions, and then allocate registers. Most of the previous work on reducing power and energy consumption in processors has focused on hardware solutions to the problem[27]. However, power is so important even in the general purpose processor area that techniques which save only a few percentage points of power dissipation are viewed as important[25]. It has been stated that with every new generation of Intel processors the power problem has increased 3 times even though the VLSI process used has a 6 times improvement in power dissipation[25]. Embedded systems designers frequently have no control over the hardware aspects of the pre-designed processor cores with which they work and so, software-based power and/or energy minimization techniques play a useful role in meeting design constraints. Price is an important driving criteria for DSP processors more so than general purpose processors. In fact for many embedded systems applications the size of memory, number of functional units in datapath etc can be optimized to reduce cost for a particular application. This support to customize a processor core is evident in industry today[26]. However price is also reflected in code size (which determines memory size or if external memory chips or modules would be required) for typical DSP applications. This is an important criteria, code size, which also impacts power as well as price. 2. General Features DSP Processors This section will generally highlight some major differences of DSP and general purpose processors (GPPs). These points arise from DSP applications themselves together with the tight performance, price and power constraints. In general DSP processors in the past implement applications which involves filters where data is multiplied by fixed coefficients, and accumulated. This recurrent theme of many digital filters lead to the datapath heavily designed with multipliers and adders. Furthermore due to tight price constraints 16 bit I/O data was typically sufficient range for voice and other types of DSP data unlike GPPs. Data processing proceeded with techniques which used fixed-point data representation (often fractional data), data scaling, and data rounding throughout the application. This produced sufficient accuracy without excessive cost of using 64 bit data or floating point representation. Addressing was also quite different from GPPs. DSP applications typically deal with array data structures, circular buffers, etc so these types of addressing including more exotic types such as bit-reversed addressing (which was ideal for efficient implementation of fast fourier transforms) were directly supported. Some relevant functional features of DSP processors (in addition to those in [12,13]) which differentiate themselves from GPPs are generally outlined below:   The DSP processor evolution will be discussed in this paper with particular emphasis on code compilation. The DSP processors used to illustrate the evolution are the TI’s TMS320C2x processor[6], the TI’s TMS320C6x processor[3], and finally to the most recently announced Star*core 140 processor core jointly developed by Motorola and Lucent[7]. The paper will first give a general description of DSP features, outlining some differences with general purpose processors. Only fixed point DSP processors will be discussed and not the more costly floating point DSP processors. Next a comparison of these three processors will be performed in general and later proceed to detail problems and improvements made with each architecture with respect to compilation. Conclusions regarding DSP processor architecture and compilation techniques for this millenium will be provided.      Word sizes vary, 8 bit radar/microcontrollers, 16/32 bit vocoders and general DSP, etc. Addressing memory to access arrays of data, or circular buffers, or accessing data in a fixed pattern such as bit reversed addressing. Multiplying data by constant coefficient and accumulating the results (typical filter structure). Rounding of data Finding the minimum or maximum of the absolute value of two words find leading number of 0’s or 1’s in the word (or left most non-redundant sign bit), and use this to determine how many bits to shift to increase dynamic range without overflowing. This is often called exponent or normalization and can be applied to an array of words (block exponent calculation). Saturated arithmetic as opposed to arithmetic with overflow. In saturation mode, if the numerical result overflows/underflows just set the result to the maximum/minimum number which fits in the word size. In general price pressures of DSP applications forces compiler to be efficient with respect to price (code size), performance, and power. Code size should be minimized so that applications where possible can fit solely onto the chip and external memory does not have to be used. This impacts cost significantly in two ways, cost of additional memory chip and power dissipation. The increase in power dissipation increases the cost of packaging especially when power exceeds 30W [25]. Meeting performance constraints are difficult since applications are continually increasing in complexity with higher throughput and lower latency requirements. This impacts the compiler since full utilization of the datapath (or functional units in the datapath) or maximum parallelism is crucial to achieving this bandwidth. Performance must be efficiently met. If a larger than necessary DSP processor is selected for an application, excess power will be unnecessarily dissipated leading to higher costs (from packaging, chip etc). Typically power, price and performance are three important criterions that impact code compilation and DSP architecture design. In this paper we will concentrate on code compilation and how DSP architectures impact this compilation process with respect to these three criteria, specifically price, performance, and power. 2.1 Introduction to 3 DSP Processors This section will introduce the three DSP processors of interest. General features, philosophies, and architecture contents will be described. The architecture will be described with datapath, register files, and memory access units. Delay slots and other details will be discussed in section 3. Table 1 provides a general overview of the three processors including their approximate year of introduction (Rel year or release year). The architecture type (Arch type) is provided, and a brief look at functional units (Func units) including address generation units (Agu) and units which perform operations on data (Fus). Two of the C6x Fu’s also provided Agu functions. The SC140 had additional registers (++) for example index registers for indexed addressing. The C6x architecture is more of a risc type architecture, meeting 5/7 of risc characteristics where as the SC140 meets 6/7 of cisc characteristics both outlined in [19] for GPPs. Rel year Arch type Func units C2x 1987[12] Single issue, cisc mac, alu, 1Agu C6x 1997[3] VLIW, risc SC140 1999[7] Mult issue, some cisc features 4 homog. Fu, 2 Agu 2 sets of 4 non-homog Fus Execution Single Up to 8 Up to 8 set length word words words Registers specialized 32 homog. 16 data, 16 addr/data address,++ Table 1. General Features of 3 DSP processors. Table 2 provides a more detailed comparison of these three processors. In this table we try to compare the operations per cycle provided by each processor architecture. Although the number of (#) Fus appears to be fewer in row 2 of table 2 for SC140 it turns out to have higher functionality when we compare operations per cycle. This is because the SC140 architecture has vertical as well as horizontal parallelism, whereas the C6x only has horizontal parallelism. As an example consider performing the following operation: d0 = round( d1 * d2 + d0) , where d0,d1,d2 are data registers. More than two instructions and cycles are required in the C6x ( a multiply, since mac was not a supported instruction, an addition, and a number of instructions to perform rounding) whereas in the SC140 only one instruction and one cycle is required using instruction: macr d1,d2,d0 . This instruction performs a mutliplyaccumulate (mac) followed by rounding the data result. Each Fu is composed of three interconnected subunits (where each Fu performs up to three sequential operations, one on each subunit). The subunits are a multiplier-accumulator unit with saturation capability and a bit field unit with a barrel shifter. Note that typically rounding and scaling methods are typically set by control registers. Performing a number of sequential operations with one instruction was also employed by an in-house core developed by Mitel[24] at the time before the C6x was announced. Essentially to balance the latency discrepancy between memory and logic gates on the chip, more operations could afford to be performed while memory was being accessed thus balancing clock periods. Furthermore this provides better efficiency with code size as well, drastically reducing number of instructions. However this CISC approach may complicate the compilers task. This will be discussed later in this paper. #fus, #agus C2x C6x SC140 1,1 6,2 4,2 #instr/cyc 1 8 6 #opns/instr 2 1,1 3,1 #opns/cyc 2 8 14 #RF,#r/RF 0,0 2,16 1,16 #units/RF 0 4,5 6 #ma/agu 1 1 1-4 Iword size 16 32 16 rword size 16 32(&16,40) 40 Table 2. Detailed Features of 3 processors. The C2x processor was an architecture heavily optimized for small code size, yet simple execution where most instructions are single cycle. It provides a number of parallel instructions including an addition and multiplication (which was not a multiplyaccumulate but accumulated the previous product and in parallel performed the next multiplication). With VLSI technology at that time a faster clock period was attained rather than implementing mac’s which accumulate the just computed product. No data register files were used, just specialized registers, 32 –bit accumulator, 32-bit product register and 16-bit input to multiplier, t, register. The architecture (which had 16-bit program and data busses) provided separate instructions to store the upper or lower half of the a or p registers back to memory. Since no address fields were required for registers and address register pointers were used to address memory operands, this architecture was heavily optimized for small code size. For example some instructions were purely opcode (apac). The C6x processor was a move towards risc, homogeneous type of architecture compared to the C2x, with the philosophy that simpler instructions would be used and the compiler could then concentrate on attaining sufficient parallelism for high performance. Although in the GPP area, VLIW architectures never did fly, it was believed that VLSI technology was now ready to build these architectures and DSP provided necessary algorithmic parallelism inherent in most filtering type applications. It was designed to achieve higher performance through it’s VLIW architecture, relying heavily on compiler support. However register files built to support the large number of functional units had to be split into two banks of 32-bit registers (due to speed considerations). To further minimize interconnect complexity between these two register files only one 32-bit data value per cycle could cross over from one register file to the units attached to the other register file. Long words of 40-bits were supported by utilizing two adjacent 32-bit registers. The units in the datapath one each side consisted of one multiplier, one Agu which also could perform additions/subtractions on data registers, one alu with shift, branch, and bit manipulation capabilities , and finally one more alu with compare and saturation capabilities. Some cisc instructions were supported including two instructions, add2/sub2 (also called subword instructions[23]), provided capability to perform additions or subtractions on 16-bit words within each 32-bit register. Register files were homogeneous, so a register could serve as an address or data or index register. The SC140 architecture was designed with a blend of risc and cisc empolyed only when it directly supports DSP programmers (for example during mac’s, macr’s, etc). A move was made towards homogeneous Fus (each with same capabilities) in order to support full utilization or parallelization within the application. For example, four mac’s could now be executed in parallel. The memory bandwidth was increased with this architecture, in effect balancing the datapath (which consisted of 4 alus). Each Agu can access 4 aligned words from memory using one move.4f instruction. It also supported hardware loop execution. Registers were heterogeneous, separating address, index and data registers, yet allowing one single data register file acessible by all Fus. Each register was 40 bits, thus directly supporting accumulations and multiplications with good accuracy. 3. Compilation Difficulties and Analysis This section will outline what specific features of each DSP processor made compilation difficult. To trace the evolution, we’ll show how past problems were avoided in the next generation of architectures and highlight the new problems introduced with each architectural feature. Difficulty in compilation is reflected in difficulty in attaining performance, code size, power and energy objectives. Discussion of difficulties compiler had with utilizing functional units, attaining parallelism, attaining code size optimization, and other DSP issues is presented along with empirical data and examples. 3.1 C2x Processor The difficulties with code compilation of the C2x processor clearly lay in the instruction set architecture (isa) itself. The isa was extremely heterogeneous but more importantly made the instruction selection compiler task tightly interdependent upon the register allocation compiler task. These two steps of compilation traditionally are always performed separately in a GPP code compiler[19]. Additional problems with code compilation stemmed from tight coupling of parallelism with instruction selection as well, although in many cases parallelism can be achieved through combining instructions if data has been allocated efficiently (which was often not the case). A second obvious problem with compilers designed for this processor was addressing code. Since there were no register files, memory addressing was performed quite often even to temporarily store and access data. Address registers could be incremented or decremented by 1, however any other offset was penalized by requiring extra instruction to perform the computation or load the address register with a new value. In summary the compiler task was made difficult by resulting:  Instruction selection tightly coupled with register allocation Address register allocation .  Address register allocation or addressing code generation: relied on these memory accesses heavily because very few registers in datapth and those which existed were for specific purposes (ie. accumulator, input to multiplier, etc). To further understand the first problem consider a multiplication operation whose output is added to another value. This type of computation involves two operations yet there are four possiblities for this mapping shown in figure 1. Choice of how to implement this functionality determines what instructions will be generated in addition to where data is stored. For example the multiplication result can be kept in register ‘p’ (output of multiplier), or transferred to register ‘a’ (using additional ‘pac’ instruction), or stored back into memory (‘pac’ and ‘sacl m’) and later restored to register ‘a’ (‘lac m’ instruction), or stored in memory (‘pac’ and ‘sacl m’ and accessed later as memory operand (‘add m’). Researchers have tried to tackle this code generation problem using a number of techniques[24,5,14] but it is still believed to be NP-complete problem[18]. Nevertheless a code generation technique in [20] found that up to 2 times improvement in performance could be attained due to propre instruction selection alone (independent of addressing code generation) compared to the ‘C’ compiler results. In fact this instruction set architecture still remains a challenge for code generation. Figure 1: Example of instruction selection for C2x. Efficient address generation for C2x was crucial to efficient compilation as well. This was challenging because only increment, decrement addressing was efficiently supported. Other types of addressing where a value greater than ‘1’ was needed to increment the address register had a penalty on code size and performance (apart from single index register whose contents could be used as offset at no cost, except loading of it’s value). Statistics were collected using ‘C’ programs describing the discrete cosine transform, fast fourier transform, elliptical wave filter, and other typical DSP filters compiled for the C2x. Using TI’s ‘C’ compiler the set of filter benchmarks showed that on average 40% of all instructions were strictly addressing computations. Researchers studied this problem from two perspectives, one was determining how best to store data in memory so as to ease the address generation[5]. The other researchers discovered a polynomial time solution for determining optimal basic block addressing for a given data layout [2]. This solution utilized network flow theory to solve the address generation problem and reduced the amount of addressing code in applications up to 6 times over compiler generated code. However optimal results for index-addressing, loops, etc remains NP-complete. 3.2 C6x Processor The previous problems of C2x , discussed in the preceding section were removed in the C6x architecture, specifically improvements were:  removed address generation costs by supporting in instruction word a 5 bit offset for post/preincre/decrementing addressing.  Simpler risc-like instructions were supported along with register files to largely remove the tight coupling between instruction selection and register allocation. More registers (register files) and homogeneous (not specailized) registers. Every could act as a data or address register. Functional Units .L1 .S1 Register File A Figure 3. Penalty of dual register files in C62xx. .M1 .D1 Memory .D2 .M2 Register File B .S2 .L2 Figure 2. C6x Datapath Architecture. Yet increased parallelism made possible by the VLIW implementation made code compilation a challenging job. In addition to other problems with code compilation arising from the architecture outlined below. New compiler problems or Compiler task made difficult by resulting:  VLIW: Scheduling operations for maximum parallelism  Dual register files  Heterogeneous Fus: Different operations supported by each functional unit.  Complex delay slots from deep pipeline.  Code size for VLIW execution. The partitioning of register files into two sides (A and B) made code generation difficult. This partitioning allowed only one value from a register file on one side to be accessed per clock cycle by functional units on the opposite side. This placed constraints on both code size and performance. One way to remove this 1-cycle-1-value limitation was to utilize move instructions, ‘mv’, (if a Fu was available for implementing this instruction) which transferred the data from one side of the register file to a register on the other register file side. This often creates a disadvantage in code size as well as performance. Consider figure 2 below where in (a) three operations can be executed in parallel, one multiply, and two subtracts. In (b) after register allocation is performed, because both subtractions are accessing a different register from opposite sides, they must be scheduled on different cycles since there is only one bus. Delay slots are more complex in the C6x processor. Most instructions have zero delay slots except for the multiply, load, and branch instructions which had delay slots of 1,4, and 5 cycles respectively. Properly utilizing these delay slots (not just filling them with nops) had a significant impact on performance. For example in figure 2 , the partial code (generated from ‘C’ compiler) on the left hand side consists of 7 instructions and executes in 10 cycles. By rescheduling this code on the right hand side the performance can be improved. This code on the right hand side requires only 6 cycles (and has 6 instructions). However the code is more complex to analyze since the value of register A3 used in lines 3 and 4 is the old value and in line 6 register A3 holds the new value which was successfully loaded. Figure 4. Complexity of delay slots in C62xx. One disadvantage to heterogeneous functional units is that full utilization is difficult. For example if our DSP code can execute 4 multiplies in parallel, only 2 can be scheduled in parallel, since there are only two multiplies in the VLIW datapath and the other 6 functional units will be idle. General instruction scheduling is now heavily constrained by feasible sets of operations due to heterogeneous Fus. Code size is also an important feature to examine. In the C6x architecture, the last bit or bit ‘p’ of each instruction is ‘1’ if that instruction is executed in parallel with the previous instruction during cycle ‘i’ else if the ‘p’ bit is ‘0’ this instruction starts on cycle ‘i+1’. Instructions executing in parallel are called an execution set (or packet). Eight aligned words (or 8 instructions) at a time are fetched from memory. However an execution set cannot cross an 8-word boundary. (The opcode determines which functional unit will execute which instruction.) For example 3 execution sets of 4, 6, and 6 non-nop instructions respectively will require 3 8-word aligned fetch accesses. In this example only 16 non-nop instructions are used, yet 24 instructions are accessed (over three cycles, not two cycles) from memory because the execution sets cannot cross an 8-word boundary. parallel in one cycle from memory (‘move.4f’). Execution sets were allowed to cross fetch boundaries. Unified Program/Data Memory PD Multiplication of coefficents, which are heavily utilized in most DSP filters, can be handled by a 5 bit immediate in the ‘mpy’ insruction, however 5 bits is rarely practical. Alternatively coefficients have to be moved from memory into registers first (using ‘mvk’ instruction) and then it can be used in a multiply instruction. One disadvantage here is that a register is required for the coefficient, possibly causing a register spill to accommodate the coefficient. PA A1 3.3 StarCore Processor The sc140 DSP processor core, the most recently announced architecture removed several problems of the previous C6x architecture, specifically:  No crossover busses were used: a single data register file accessible by all Fus.  No heterogeneous Fus: all Fus had same functionality and capabilities.  Code size: Execution sets could now cross fetch boundaries The architecture is illustrated in figure 5. The 4 Alus are multiplier-accumulator units with saturation capability and each has a bit field unit as well for shifting/rounding. Connections between the datapath and memory are 64 bits wide. For example this allows each of 2 Agus to access 4-16bit fractional words in D1 64 128 32 32 D2 64 32 128 Address Generator Register File Program Sequencer 2 AGU 128 Statistics from studying over 35 DSP filters compiled from ‘C’ codes using TI’s compiler provided the following results. The M units were used 3% per cycle, L and S units ranging from 2 to 7% utilization and the D2 unit was heavily used at 29% per cycle compared to the D1 unit at 3%. The crossover busses, transporting data from one register file to the next, were used 5% of the time per cycle on average (heavier usage than some Fus). The compiler was quite conservative and delay slots were often filled with ‘nop’s in the compiled output. A local improvement scheduler (based on extensions to scheduling mode in [11]) was able to improve performance of the C6x compiled code by 50% strictly by rescheduling instructions and not renaming registers, indicating that the compiler not extracting close to optimal parallelism. In summary parallelism was restricted by heterogeneous Fus, dual register files and delay slots. A2 Data ALU Register File BMU 4 ALU Instruction Bus Figure 5. SC140 Architecture . Utilizing the SC140 architecture, the three tasks, instruction selection, scheduling, and register allocation, were more independent, thus easing the compilers job. Furthermore most data computation instructions were single cycle, required no delay slots, and could be performed by any of the four Fus in the datapath. Thus parallelism was independent of the type of instructions since each Fu could execute the same set of instructions. The following issues remained as a challenge for the Star*core compiler:  Prefix Grouping Overhead: required only when serial grouping could not be utilized due to :  Upper bank of Register File: An extra word is required in execution set if register(s) from the upper bank of the register file were used.  Instruction word extensions: For example immediate addressing required an extra word  High memory bandwidth available: aligned multiword memory accesses, up to 4 words per Agu per cycle. Two schemes, serial grouping and prefix grouping, were used for defining execution sets in the SC140. The first scheme is the serial scheme (similar to ‘p’ bit used in C6x, except fetch boundaries could be crossed). As shown in figure 6 two bits in the opcode were used to identify which instructions were to be executed in parallel with the previous instruction. However to further optimize on code size this mode was supported only if lower bank registers were used (where each would require only a 3 bit address). In the case where one or more registers in the execution set were from the upper bank of registers a prefix instruction grouping was used where two extra 16-bit words are required, one to identify upper bank registers, the other to denote prefix grouping information. Instruction 1 Serial Grouping 00 Prefix word Prefix Grouping Instruction 1 Instruction 2 00 Instruction 2 Instruction 3 01 hidden by utilizing higher memory bandwidth (loading in parallel more than one word at a time, move.2f or move.4f) or by utilizing several accesses of the same coefficient creating an overall savings in code size. An optimization approach is described in [31] which performs this code size reduction in polynomial time. Instruction 3 011 Figure 6. Instruction grouping in SC140. Immediate addressing was supported in SC140 thus allowing multiplication of coefficients in filters without usage of a register. However since instructions were 16-bit (not 32-bits like in C6x), an extra word (instruction word extension) along with the additional prefix word was required in the execution set. The compiler must be careful with it’s decision to use immediate or registered addressing. For example, since the execution set is fixed at 8 words, a 2 word prefix plus 4 words (providing one instruction per Alu) plus 2 words for Agus allows 100% utilization of the datapath (or full parallelism). However if a ‘macr’ instruction uses immediate addressing, an extra word will be required, thus one unit out of 6 must remain idle. Furthermore if the compiler is not efficient at using the lower register bank and registers from the upper register bank are also used, a 2nd unit will be forced to be idle because of the maximum execution size of 8 words. Figure 7 illustrates the impact of immediate addressing on code size and indirectly on performance. On the left hand side, the compiler generated code is shown (instructions executed in parallel are illustrated inbetween square brackets []) which has 7 instructions, requiring a code size of 12 words (prefix grouping used in 2nd and 3rd cycles). Immediate addressing is used, for example #-6554 represents an immediate value of -6554, requiring an instruction word extension along with prefix grouping. At most two instruction word extensions are allowed per execution set. By choosing when to use immediate versus registered addressing one can often improve code size. In figure 7 the code example shown in the middle has a savings of three instruction word extensions and two prefix words thus reducing code size to 8 words. This savings is possible since serial grouping can now be used in place of prefix grouping. In some cases this modification can even improve performance. Examine the code shown on right hand side of figure 7, where rescheduling has occurred to save one cycle. Note that the actual loading of the coefficients into the registers are not shown in the figure. The cost of this load could be Figure 7. Code size improvement in SC140. Memory bandwidth must also be optimized to take advantage of the SC140 architecture. Since each Agu can access a maximum of 4 aligned words from memory, careful memory layout should be performed. Furthermore since 4 Fus are also available in the datapath, techniques such as loop unrolling in many cases are simple yet ideal methods for utilizing this memory bandwidth. Consider figure 8 which illustrates code generated in the body of the loop. On the leftt hand side the code represents a biquad filter. By unrolling the loop, one can change single moves, ‘move.f’, to multiple moves, ‘move.2f’, thus utilizing the higher memory bandwidth available in this DSP architecture. The right hand side of figure 8 illustrates this idea by unrolling the biquad loop once. In practise one can unroll four times, to fully utilize the memory bandwidth. However in general loop unrolling increases the code size, thus careful memory layout should also be investigated to utilize the available memory bandwidth. The C6x by comparison improved upon the C2x problems however suffered from dual register file interconnection structure and heterogeneous Fus hindering compiler performance. Empirical evidence here showed that very low utilization of functional units, ranging on average from 2% to 6% utilization per cycle, and under utilized delay slots. Claims were supported by again empirical results which showed up to 61% improvement in performance could be attained through utilizing complex local rescheduling through delay slots. However even with this increased parallelism, functional unit utilization was low. Figure 8. Unrolling loops in sc140 for max bandwidth. Other features of this architecture which greatly aid compilation are the single cycle execution of most instructions, limited delay slots (generally zero delay slots for all instructions except control type of instructions) and homogeneous Fus. These characteristics greatly aid loop pipelining and instruction selection, both of which the compiler performs very well. 4. Conclusions In summary, the evolution of DSP processors has been discussed starting with the C2x , through the C6x and finally to the sc140. Identification of what architectural features of the DSP processor made compilation difficult were presented. Each claim was backed up through empirical results or specific examples. In general the C2x suffered from a processor architecture whose specialized registers created compiler difficulties. In effect this architecture made instruction selection and register allocation tasks highly interdependent. These two steps generally are done separately in compiler technology leading to inefficiencies in code generation for the C2x. Empirical results supported this claim, showing that compiler-generated code could be further optimized to provide 2Xs improvement in performance (with no addressing code). Furthermore the C2x compiler was not sophisticated enough to generate efficient addressing code. Empirical results supported this 2 nd claim indicating that on average 40% of the compilergenerated code size was solely composed of addressing code and through use of a post-compiler optimization technique reductions in addressing code alone of up to 6Xs was possible. It has been argued that many of these past problems with architectures which make compilation difficult have in general been eliminated with the sc140 design. Specifically the SC140 architecture supports a single data register file, mostly single cycle execution (limited instructions with delay slots), homogeneous Fus, compacter code size (16 bit instructions & execution sets which cross fetch boundaries) and higher memory bandwidth. Those challenges remaining for the SC140, generally impact code size as opposed to performance which is preferrable. Power dissipation is generally related to efficient instruction usage, and good hardware architecture design practices. Other features of this new generation of DSP processors which remain to be utilized by compilers are special DSP addressing modes (modular addressing ), hardware loop execution, etc. In conclusion, architectural features which ease compiler design include: homogeneous Fus, single register files, limited delay slots, and cisc only where it makes sense for DSP programming. Current challenges for compilers with these newer architectures include utilization of specialized addressing, hardware loop execution, and utilization of available high memory bandwidth. It is clear that future DSP architectures will be easier to compile to, which should proliferate the usage of DSP programming in ‘C/C++’ , in effect greatly eliminating the DSP assembly programmers and drastically decreasing the time to market for future products/markets. References [1] M. Lee, V. Tiwari, S. Malik and M. Fujita. “Power Analysis and Minimization Techniques for Embedded DSP Software”. IEEE Trans. on VLSI Design, Vol.5, No.1, March 1997, p123-135. [2] C.Gebotys, “a minimum cost circulation approach to DSP address code generation “, IEEE Trans on CAD, Vol. 18, No.6, June 1999, pp 726-741. [3] Texas Instruments. TMS320C62xx CPU and Instr. Set Ref. Guide,. TI Inc., 1997. [4] P.Marwedel, G.Goossens, Eds., Code Generation for Embedded Processors, Norwell, MA, Kluwer 1995. [5] S.Liao, S.Devadas, K.Keutzer, S.Tjiang, A.Wang “storage assignment to decrease code size, ACM SIGPLAN conf. Programming lang. Des. And impl. PLDI 1995. [6] Texas Instruments. TMS320C2x User’s Guide , TI Inc., 1993. [7] Motorola and Lucent, Star*core 140 Specifications, Rev. 0.63, September 1999. [8] Catherine H. Gebotys and Robert J. Gebotys. “An Empirical Comparison of Algorithmic, Instruction and Architectural Power Prediction Models for High Performance Embedded DSP Processors”. ISLPED, August 1998, p121-123. [9] C.Gebotys, R.Gebotys, S.Wiratunga “Power Minimization derived from Architectural-usage of VLIW processors”, Proceedings of Design Automation Conference, June 2000. [10] C.Gebotys, R.Gebotys, ”Statistically-based prediction of power dissip. For complex Emb. Dsp Processors”, Microprocessors & Microsystems Journal, 1999. [11] C.Gebotys, “Throughput-optimized architectural synthesis”, IEEE Trans. VLSI, Sept. 1993. [12] E.Lee, “Programmable DSP Architectures: Part I”, IEEE ASSP Magazine, October 1988, p4-19. [13] E.Lee, “Programmable DSP Architectures: Part II”, IEEE ASSP Magazine, January 1989, p4-14. [14] G.Araujo,S.Malik, M.Lee “Using register-transfer paths in code generation for heterogeneous memoryregister architectures” DAC, 1996. [15] R.Leupers, P. Marwedel “time-constrained code compaction for DSPs”, ISSS 1995. [16] S.Liao, S.Devadas, K.Keutzer,S.Tjiang, A.Wang “code optimization techniques for embedded dsp microprocessors” [17] W.Lin, C.Lee, P.Chow “an optimizing compiler for the TMS320C25 DSP chip”, ICSPAT Oct 1994, p I689-694. [18] Garey and Johnson, Computers and Intractability, New York- Freeman, 1979. [19] A. Appel, Modern Compiler Implementation in C, Cambridge University Press, 1998. [20] C.Gebotys, “an efficient model for dsp code generation: performance, code size, estimated energy”, Int’l Symp on Sys Synthesis, 1997. [21] C.Gebotys, R. Gebotys “complexities in DSP software compilation: performance, code size, power, retargetability” , HICSS, 1998. [22] C. Gebotys, R. Gebotys, “Statistically-based prediction of power dissipation for complex embedded DSP processors”, Microprocessors and Microsystems, Elsevier, 1999. [23] Adve, et al. “changing interaction of compiler and architecture”, IEEE Computer, p51-58, December 1997. [24] Discussions about ‘Midas’ in-house processor core with Alex Tulai, Mitel Corp, Ottawa Ont Canada, 1997. [25] D.Singh, Intel, Manager of Microprocessors, Personnel Communication, 1997. [26] Telesica, www.telesica.com, DATE 2000. [27] A.Chandrakasan, R.Broderson, Low Power Digital CMOS design, Kluwer Aca. Pub, Dordrecht, 1995 [28] Hennessy and Patterson, Computer Architecture: a quantitative approach, Morgan and Kauffman, 1990. [29] I.Bolsens, Mod. Plenary: Keynote Session: Connected, Smart Devices – Computing beyond the desktop, DATE2000, Sigda, 2000. [30] Y-T.S.Li, S.Malik, “Performance analysis of embedded software using implicit path enumeration: IEEE Trans. CAD 16(2),1997,p1477-1487. [31] C.Gebotys “an optimized approach to immediate versus registered addressing in sc140”, Techn. Rept., Dept E&CE, University of Waterloo, 2000.

paper - Electrical and Computer Engineering

Related documents

Products

Support

paper - Electrical and Computer Engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib