A multimedia-evaluation of the Infineon TriCore Ari Wahyudi, Amos R. Omondi, and Thambipilai Srikanthan School of Computer Engineering Nanyang Technological University N4 Nanyang Avenue SINGAPORE 639798 Ph801819@ntu.edu.sg Abstract: This paper reports on our evaluation of Infineon TriCore, a DSP-controller processor, on multimedia applications. We studied signal processing and multimedia hardware support of TriCore. The evaluation is part of a larger study that aims to use the TriCore as a building block for a multimedia multiprocessor. Various signal processing and multimedia benchmark programs were coded in assembly and C language and then run on the TriCore TC10GP and on the Intel Pentium-II, a typical general-purpose processor. We then performed cost-performance analysis and comparison of both processors. Our experiments showed that TriCore is a well suited for the designs we envisage. We also comment on the TriCore’s suitability for embedded multimedia processing Keywords: multimedia, control, signal processing. Introduction Multimedia computation is one of the major driving forces in the development of high performance processors, mainly because a typical multimedia application requires real time signal processing capability. The Infineon TriCore, a relatively new microprocessor, utilises a unique approach in its architecture and implementation [7]. This study evaluates both the architecture and implementation of the TriCore on multimedia applications. This paper consists of four other sections. The second section discusses the different natures of signal and control processing, multimediaprocessing requirements, and how the TriCore’s features meet these requirements. The third section reports on our experiments on the TriCore and the Intel Pentium-II [6]. The fourth section discusses the TriCore’s suitability as an embedded processor and as a building block for a multiprocessor system. And the fifth section is a concluding summary. Relevant features of the TriCore Although the TriCore RISC-DSP promises opportunity for performance increase, there has been no independent evaluation of multimedia applications running on this processor, nor has there been much discussion of its performance and cost effectiveness. This study provides such information. For comparative evaluation, the other processor we chose was the Pentium-II MMX, as its multimedia extension (MMX) is well known and has been the subject of several evaluations, for example [1, 2]. Control processing and general purpose computing are different from digital-signal processing (DSP) in several aspects: Data structures in DSP applications are often in form of vectors, whilst control application use more conventional data structures and arithmetic (with enhancements for bit operation and string processing). Data and program addressing in DSP are commonly more regular and access certain locations repeatedly. Program control in DSP is oriented towards fast execution of tight loops of code; program branching is not so complex, and most programs are developed in assembly language. On the other hand, a controller 128 typically responds to more random and nondeterministic inputs; so operation is highly datadependent and program branching is more complex. Most controllers do not perform DSP’s task very well, and most DSPs do not perform controller’s task very well. The TriCore design aims to correct this disparity. Pirsch [3] states that the important characteristics of a multimedia processor are intensive computation for highly regular operations, intensive I/O or memory access, data reusability and locality, high control complexity in less computational intensive task, and frequent use of small integer operands. A typical way to obtain higher processing power for multimedia processing is by exploiting parallelism, either data-level or instruction-level. The Pentium with MMX technology uses single instruction multiple data (SIMD) processing, with eight 64-bit registers that hold 8x8-bit data, or 4x16-bit data, or 2x32-bit data, or 1x64-bit data to implement the data-level parallelism. TriCore also utilises SIMD techniques, with sixteen 32-bit data registers on which to perform the packed data operation. However, the computational organisation of packed data for multiplyaccumulate operation on the TriCore is slightly different from that of the Pentium-II MMX. The general requirements of multimedia processing can be handled with typical DSP functionality, and high control-complexity requirement of multimedia processing fits with the functionality of a controller. The Infineon TriCore, a DSP-RISC processor, was developed in attempt to combine in a single core the capabilities of a microcontroller, a digital-signal processor, and a general-purpose processor. The TriCore has four main features: (a) Integrated microcontroller and DSP in single core. The processor is able to efficiently perform both control processing and signal processing. Thus, a wider range of multimedia applications with different computation complexities can be handled easily. (b) Low interrupt latency and fast context switching capability. The fast context switching allows clean, fast, and efficient processing of multiple tasks on one engine. This feature provides good supports for complex multimedia application with several multimedia tasks running in single processor. (c) Support for peripherals interfacing and adding custom logic to the core, leading to more flexibility in realising a multimedia embedded system. (d) Powerful I/O support. In TriCore, the I/O capability is provided by a special I/O processor, the peripherals control processor (PCP). The PCP handles inter-peripherals, I/O, and data transfers (from/to memory) without loading the CPU, thus leaving the CPU to do other processing tasks. These four features together make the TriCore a suitable processor for high-performance, costeffective multimedia processing. Evaluating TriCore In this section we discuss the evaluations that we carried out, the results, and the analysis. Methodology The evaluations were performed on TriCore TC10-GP and Intel Pentium-II MMX 233 MHz processors. The Intel Pentium-II was on a PC with Windows 95 operating system, and the TC10-GP was evaluated using a TriCore development board (TriBoard). Program benchmarks were downloaded into the processor and debugged using the JTAG parallel port interface that is available on the TriBoard. We used benchmark programs, consisting of signal processing kernels and multimedia applications, to evaluate the processors. Two versions of codes were used in the experiment: codes written in C with conventional arithmetic and optimised versions of the codes; the latter were mostly written in assembly language to facilitate the use of the specific multimedia/signal processing instructions (MMX and SIMD) of the processors. The codes for the Pentium-II were compiled using Intel C/C++ compiler; for the TriCore the codes were compiled using the HighTec GNU development tool. The C codes for both processors were compiled using the optimisations-for-speed option. 129 We used MMX programming support provided by Intel C/C++ V4.5 compiler to code the benchmark programs to use MMX instructions; this programming support made easier the MMX programming. Since TriCore GNU tools V1.3 does not provide support for packed data (SIMD) or other TriCore DSP instructions, the assembly codes generated from original C codes were edited and optimised to use those instructions. Complete summarises of hardware, software, and tuning parameter for the systems evaluated are shown in Table 2. Measurement method We use Pentium’s RDTSC (Read Time Stamp Counter) instruction to measure the execution time of the codes; and in TriCore, the execution time information was obtained by reading the processor’s system timer register 0. This method allows us to obtain the execution time in terms of CPU clock cycles. Execution times for each code were obtained by executing the code three times inside a loop, and then one execution time was selected. This was done to ensure that the program and data caches of the processors were loaded with the appropriate program/data and, therefore, that execution time reported from the experiment represents the best processor’s performance. Metrics The main parameters of concern in the experiment are the clock-cycle efficiency, the execution speed, and the speed-up obtained from the SIMD/MMX/DSP hardware support of the processors. The implementation costs of the two processors relative to their performance are also analysed. The cost metrics used in the experiment are chip area, number of transistors, and power consumption. Benchmarks We used more signal-processing kernels and multimedia applications than has been done in similar analysis by other researchers. We did this in order to ensure that the study would provide more reliable justification for the system being evaluated. The SIMD/MMX codes were modified to use assembly language with processor-specific optimisation; some of the modifications were taken from Intel and Infineon libraries and application notes, in order to ensure that the implementations were optimum and reliable. The other kernel implementations (not provided by Intel and Infineon libraries and application notes) were optimised by adapting the optimisation strategies used for the other kernels in the libraries and application notes. Error! Reference source not found. shows the implementation parameters of the kernels and multimedia applications used in the evaluation: Finite Impulse Response (FIR): y ( n) M 1 c k 0 k x(n k ) Our FIR filters in SIMD/MMX/non-SIMD/nonMMX implementa-tion use 16-bit integers. Optimisations used in the FIR implementation on Pentium-II are the MMX instruction set, placing instruction sequence in ‘good’ order (that allows execution of up to three instructions per cycle), loop unrolling, and optimal alignment of data in memory. TriCore has powerful multiplyaccumulate instructions that perform very well with packed data types. TriCore also allows 64bit data-loading operation in parallel with the packed multiply-accumulate operation. This capability enables the processor to perform true two 16-bit integer multiply-accumulate operations in one cycle. Optimisations used for the FIR filter on TriCore are in loading/storing packed data, packed arithmetic, and zerooverhead loops. Infinite Impulse Response (IIR): Q 1 P 1 q 0 p 0 y (n) bq x(n k ) a p y (n p) Both types of code (basic arithmetic and optimised) perform the computation with 16-bit integers. For both processors, the implementation strategy used for this filter is the similar to filter that used for the FIR. 130 Matrix-vector arithmetic, consisting of vector dot-product and matrix-vector multiplication. Our implementations use 16-bit integers. Table 1 Summary of kernels and applications FIR Integer 16 bit data type, len. 13, 140 pt IIR Integer 16 bit data type, len. 13, 25 coef., 140 pt MatVect Matrix [512][512] and vector [512] multiplication (16 bit integer) VecDotP Two vector[512] dot product (16-bit integer) LMS Integer 16-bit data type, 351 samples, filter order: 20 ADPCM Integer 16 bit data type, 6000 samples, test file: “chk.wav” FFT 1D Complex, 16 bit integer data types, 4096 pt FFT 2D Complex, 16 bit integer data types, 16x16 pt MPEG-2 3 frames, YUV, 4:2:0, 256x256 pixel, 256 color, test file: “pingpong” The implementation strategy for this algorithm on both processors is similar to the core strategy used for the other algorithms above that perform multiply-accumulate operation. Least Mean Square (LMS) Adaptive Filter. This algorithm attempts to find an optimum set of filter parameters based on the time-varying input and output signals: Q 1 y ( n) b q ( k ) x ( n q ) q 0 where b(k) is the time-varying coefficients of the filter. The filter implementation used here is based on [4] and utilises multiply-accumulate operations. The SIMD/MMX version of this code (for both processors) was modified by optimising only the multiply-accumulate part of the algorithm. ADPCM G.722. This is a speech-encoding standard for compressing and decompressing speech and audio signals whose frequency range from 50 Hz to 7000 Hz. As with the LMS filter implementation, the only part of this algorithm that were optimised are the looping and multiplyaccumulate operations. Fast Fourier Transform (FFT). This is an efficient algorithm for computing discrete Fourier transform (DFT) of a sequence: X (k ) X ev (n) W k N / 2 X od (n) where Xev represents even-indexed elements and Xod represents od-indexed elements. We use inplace, radix-2, decimation-in-time FFT for the experiment. The implementations use 16-bit integers. MPEG-2 Compression. This is a standardised compression method for moving images. The primary parts of the MPEG are discrete cosine transform (DCT), quantisation, motion estimation, Huffman coding, and run-length coding. We used MSSG (MPEG Software Simulation Group) code for the MPEG-2 implementation. For both processors, the optimisations were done for the motion estimation and DCT components. The DCT implementations on both processors use 16-bit integer data and SIMD arithmetic. Block-distance calculation (a major part of motion estimation) on Pentium-II was easily realised with the MMX instruction set. In TriCore SIMD operations cannot be used to optimise the block-distance calculation; this is because of data alignment restrictions on the TriCore, which is not allowing sequences of data to be loaded/stored if the source/target memory location is not aligned to four. So the optimisation used for the blockdistance calculation in the TriCore uses TriCore’s abs instruction to compute the absolute-difference value of two pixels in block-distance computation. Results and discussion The measurements were repeated several times for each algorithm. For the large and complex algorithms (MPEG, LMS, and ADPCM), the measurements produced slightly different results for each algorithm; we give the smallest numbers obtained from several measurements. In what follows, we shall use the term relative performance of TriCore to Pentium, to mean the ratio of performance of an algorithm on TriCore to performance of the same algorithm on Pentium-II. Raw performance here is measured in terms of number of cycles. Tables 3 and 4 show the basic results from the experiments. The results show that for some programs, MMX, SIMD, and DSP supports on both processors provide significant speed-up over traditional implementations. For the TriCore, the speed-up ranges from 1.63 to 12.27, for the 131 Pentium II Model Number Clock speed FPU Primary cache Secondary cache Other cache Memory (internal) Memory (external) Other hardware O/S and version Compilers and version Other software TriCore TC10GP Hardware Parameters: Pentium-II 233 TC10GP 233 MHz 16 MHz (on board) Integrated None 16 KB (Inst.) 8 or 16 KB (Inst.) * 16 KB (Data) 0 or 16 KB (Data) ** 512 KB None None None None 0 or 8 KB inst. SRAM * None 16 or 32 KB data SRAM ** None None None None 64 MB (DRAM) 4 MB (SDRAM) 2 MB (Flash) None None Software Parameters: Windows 95 None Intel C Compiler GNU/TriCore V1.3 V4.5 Intel Vtune V4.5, Tasking tool Microsoft Visual Studio 6.0 *: Can be configured in two ways: 1: 8 KB instruction SRAM, 8 KB instruction cache 2: 16 KB program instruction cache only (no scratch-pad instruction SRAM) **: Can be configured in two ways: 1: 32 KB data SRAM only (no cache) 2: 16 KB data SRAM and 16 KB data cache Table 2 Machine, software, and baseline tuning parameters kernels and 1.1 to 1.62 for the applications; and in the Pentium-II, the speed-up ranges from 1.33 to 3.47 for the kernels and 1.01 to 1.73 for the applications. This suggests that the TriCore has a better architecture (ISA), although the results are slightly affected by the quality of the compiler used, since the Intel C/C++ compiler technology of Intel machines is more mature than the GNU C compiler of the TriCore. the computation is in the motion estimation means that the speed-up for the overall operation is quite small. Table 3 Execution times and speed-up of benchmarks Type Unoptimised MMX/SIMD Unoptimised VecDotP MMX/SIMD Unoptimised FIR MMX/SIMD Unoptimised IIR MMX/SIMD Unoptimised FFT 1D MMX/SIMD Unoptimised FFT 2D MMX/SIMD Unoptimised LMS MMX/SIMD Unoptimised ADPCM MMX/SIMD Unoptimised MPEG-2 MMX/SIMD Unoptimised Average MMX/SIMD MatVect The TriCore with SIMD and hand-optimised codes generally provides better speed-up than the Pentium-II with MMX support, except for MPEG-2, for which the Pentium-II with MMX achieves speed-up of 1.73, while the TriCore is only able to achieve 1.62. The most significant factor which influences the better speed-up of MPEG-2 in the Pentium-II is the optimisation performed on the motion estimation part; the motion estimation part in TriCore could not be optimised with the SIMD/packed data arithmetic. We examined each component of the MPEG-2 and found that in TriCore, the factor that most influenced speed-up is optimisation in the forward DCT part; however, the fact that most of 132 Pentium TC10-GP Clk. cycles Speed- Clk. Cycles Speedup up 1757121 3937432 2.41 5.4 729298 729405 1272 1930 3.47 5.74 367 336 3958 17513 3.12 8.65 1269 2052 8106 36276 3.14 12.27 2583 2957 793075 1250987 1.33 1.88 595229 664345 42437 53074 1.33 1.63 31812 32612 99138 241854 1.01 1.43 98516 169382 4379791 6437605 1.1 1.1 3988417 5823432 144345486 149460673 1.73 1.62 83646350 92349803 16825598 17937483 2.07 4.40 9899315 11086036 Except for vector dot-product, the other signal processing kernels and applications require fewer CPU cycles in the Pentium than in the TriCore mainly because the Pentium has a more aggressive implementation. In vector dot-product kernels, the TriCore has the better performance: the TriCore relative performance to Pentium-II is 1.09; and in matrix-vector multiplication kernel, the TriCore performance is nearly the same as the Pentium’s. The TriCore seems to perform very well in algorithms with tight looping and highly regular multiply-accumulate operations, such as the vector-dot product and matrix-vector multiplication algorithm. For both one and two-dimensional FFT, the TriCore shows almost similar performance to the Pentium: the TriCore relative performance is 0.8960 for one-dimensional FFT and 0.9755 for two-dimensional FFT. The speed-up of the optimised FFT code in the TriCore is much better than that of the Pentium: 1.88 and 1.63 on TriCore, and 1.33 and 1.33 on Pentium-II, for 1D and 2-D FFT respectively. The source codes reveal that the higher speed-up for FFT code in TriCore is the result of not just from the usage of SIMD instruction set: there is also a reduction in data-register requirements that would otherwise cause memory transfers. In running FIR and IIR, the implementation on the Pentium-II requires fewer cycles than the TriCore: the TriCore relative performance is 0.6184 for the FIR and 0.874 for the IIR. In other applications (LMS and ADPCM), the results also show that the Pentium-II MMX requires fewer cycles than the TriCore: TriCore’s relative performance to Pentium is 0.5816 for LMS and 0.6849 for ADPCM. The more irregular structure and fewer tight-loops with multiplyaccumulate operation in the LMS and ADPCM algorithms mean that they can be executed in Pentium in fewer cycles than on the TriCore. LMS and ADPCM have relatively fewer signals processing instructions than the other kernels and applications, which also mean that only small speed up could be achieved from the optimisation performed on both processors. In TriCore, the FIR’s and IIR’s inner-loop kernel were first coded to use the zero-overhead-loop instruction, an instruction that eliminates the overheads of a conditional jump instruction that is normally at the end of an instruction sequence running in a loop; this is achieved by setting the address register to automatically point to a certain memory location. But the implementation with that instruction operates more slowly than without. The FIR and IIR were then coded to use loop-unrolling optimisation. It seems that for relatively small filter lengths (fifteen as we used in the experiment), the zero-overhead loop instruction generates some overheads because of the initialisation required. The loop-unrolling implementation requires more static code space than the zero-overhead loop implementation, but from our observation the amount of space was not very significant. Table 4 TriCore relative performance to Pentium-II Algorithm MatVect VecDotP FIR IIR FFT 1D FFT 2D LMS ADPCM MPEG-2 Relative Performance 1.0000 1.0923 0.6184 0.8735 0.8960 0.9755 0.5816 0.6849 0.9058 TriCore provides a true multiply-accumulate instruction and can also carry out such an instruction in parallel with a 64-bit data transfer operation (memory-to-register or register-tomemory). On the other hand, data packing and unpacking operations in the MMX generate overheads in the execution of multiplyaccumulate operation, as well as in the requirement of shift operations to prevent data overflows. In TriCore, the shift operations are not necessary because the computations are arranged to use the wider accumulator register (64-bit). Maximum throughput in Pentium with MMX is eight multiply-accumulate operations of 16-bit number in three cycles [5], i.e. about 2.67 instructions per cycle; whereas the TriCore is able to execute at most two multiply-accumulate operations of 16-bit numbers in one cycle. Combined with special addressing modes, zerooverhead loop, and wider accumulator register, the TriCore can achieve a clock-cycle efficiency comparable to that of the Pentium-II but with less 133 code space. The packed-data arithmetic combined with parallel data loading is a powerful DSP feature of the TriCore. Table 5 shows CPI information of both processors for certain kernels. (CPI is defined as CPU clock cycles for a program per dynamic instruction count). The CPIs on Pentium-II was obtained using Intel VTune performance analyser tool. In TriCore, due to the tool unavailability, we simply manually counted the number of dynamic instructions of certain codes and combined this with CPU cycle information to obtain the CPIs. Table 5 CPI of optimised codes Pentium-II MMX CPU IC CPI cycles FIR 1269 2040 0.622 IIR 2583 4039 0.639 MatVect 729298 394482 1.849 VecDotP 367 631 0.582 TriCore TC10GP CPU IC CPI cycles 2052 2806 2957 4560 729405 330760 336 651 multiply-accumulate operation TriCore’s performance is comparable to that of the Pentium-II MMX. This attests to the superiority of TriCore’s architecture in this regard. Cost-performance analysis Table 6 lists the costs variables used in our experiment. The costs are measured in three ways: number of transistors on the core, die size, and power dissipation. We are interested in these parameters in order to factor out implementation (micro-architecture) and realisation (technology) from architecture comparisons and also in order to evaluate the TriCore for embedded multimedia applications. Table 6 Scaled cost parameters 0.731 0.648 2.205 0.516 Based on the CPIs, generally, Pentium-II has a better instruction parallelism than the TriCore. Nevertheless, these figures do not indicate that the Pentium has either a better architecture or a better implementation. The Pentium-II microarchitecture is highly superscalar, with much deeper pipelines, and so naturally ought to perform better in terms of clock cycles. A complete comparison of the two microprocessors should factor in the cost of realising the Pentium’s aggressive implementation. When this is done, the TriCore is shown to have the superior architecture. For the vector dot-product kernel, the result for TriCore shows a better CPI than for the Pentium-II. This is because the multiplyaccumulate operation can be performed very Area Num. Of Transistors Power dissipation. Pentium-II 131 mm2 38.5 million 20.988 Watt TriCore 5 mm2 5 million 0.3495 Watt Table 7 Cost: performance of Pentium-II CPU Cost1*C Cost2*C Cost3*C Cycles (C) mm2 x 10-7 million x 10-7 Watt x10-7 FIR 1269 0.0166 0.0049 0.0027 IIR 2583 0.0338 0.0099 0.0054 MatVect 658744 8.6295 2.5362 1.3826 VecDotP 367 0.0048 0.0014 0.0008 LMS 100610 1.3180 0.3873 0.2112 ADPCM 3988417 52.2483 15.3554 8.3709 FFT 1D 595229 7.7975 2.2916 1.2493 FFT 2D 31812 0.4167 0.1225 0.0668 MPEG 83646350 1095.7672 322.0384 176.5570 Average: 129.5814 38.0831 20.7607 Average ***: 8.8082 2.5887 1.4112 Table 9 Cost: performance of TriCore TC10GP Table 6 Cost parameters Pentium-II Area 202 mm2 Num. Of Transistors 38.5 million ** Power dissipation. 34.8 Watt Process technology 0.35 micron Clock rate 233 MHz Voltage 2.8 V approximate, ** includes L2 cache TriCore 5 mm2 5 million * 1.5mW/MHz 0.25 micron 66 MHz 2.25 V efficiently on the TriCore. Although the TriCore has less a smaller data register width (32-bit) than the Pentium with MMX has (64-bit), for a pure CPU Cost1*C Cost2*C Cost3*C Cycles (C) mm2 x10-7 million x10-7 Watt x10-7 FIR 2052 0.00103 0.00103 0.00007 IIR 2957 0.00148 0.00148 0.00010 MatVect 729405 0.36470 0.36470 0.02549 VecDotP 336 0.00017 0.00017 0.00001 LMS 169382 0.08469 0.08469 0.00592 ADPCM 6289027 2.91172 2.91172 0.20353 FFT 1D 664345 0.33217 0.33217 0.02322 FFT 2D 32612 0.01631 0.01631 0.00114 MPEG 92349803 46.17490 46.17490 3.22763 Average: 5.54302 5.54302 0.38746 Average ***: 0.46403 0.46403 0.03244 *** average without MPEG-2 134 Because of the differences apparent in Table 6, the raw numbers obtained cannot be used other than to compare raw performance. To make broader and more meaningful comparisons, we scaled the figures so that both sets correspond to 0.25 micron, 233 MHz, and 2.25V. The scaling is not very precise but is adequate for broad comparisons. The results of the scaling are shown in Table 7. Using these, we then obtained the cost: performance figures given in Tables 8 and 9; here, cost1, cost2, and cost3 are chip area, number of transistors, and power, respectively. The results show that TriCore consistently has the better cost: performance ratios. Put another way, for the same cost, the TriCore has the better performance. These suggest that the quality of TriCore’s architecture and implementation are appropriate for the uses we have in mind – embedded multimedia processing, with both single and multiple processors. We next comment on other features of the TriCore that fit in with these goals. write/read access to peripheral/memory/register on the other host, which in turn allows fast data transfer without the necessity to load both CPUs for the data task. Bus sharing also means additional overheads to the external bus unit, but, if frequently accessed data or code are located in the internal memory, and the external memory is used to store only the data that will be transferred, then the external bus overhead would be greatly reduced. block 1 TriCore 1 BUS TriCore 2 Memory I/O port TriCore 1 BUS I/O port TriCore 2 Memory TriCore-based systems Global BUS 1 TriCore has several features to support a multiprocessor system: a fast context-switching capability, powerful I/O support (via a Peripherals Control Processor), and a bus-sharing mechanism. Global BUS 2 block n I/O port TriCore 1 BUS I/O port TriCore 2 Memory Inter-node communication mechanism (blocking or non-blocking) in a multiprocessor system typically requires task-switching between data processing task and the data communication handler task. In an application with frequent inter-node data transfers, the fast contextswitching feature of the TriCore would reduce the CPU time taken up by the inter-node communication task. The fast context-switching may also be used to support multithreading within a single processor running different multimedia applications. The PCP (Peripherals Control Processor) performs tasks that in a traditional computer system are normally performed by a combination of a DMA controller and its supporting CPU interrupts service routine. The PCP improves the responsiveness of interrupt service in data transfer and data capture operations. The bus sharing mechanism enables two TriCore processors to be connected with their external bus being shared without the need of additional glue logic. This feature enables direct 1 Group of block Figure 1 TriCore-multiprocessor architecture Figure 1 shows an example of a simple organisation for a TriCore-based parallel processor. The system has several blocks, each of which consists of two TriCore processors. Interprocessor communication in one block is performed by using the memory and bus sharing mechanism; and inter-block communication is performed through the PCP-controlled I/O port. The global bus is used for inter-block data transfers. Since each block has two TriCore processors, and each processor has its own I/O port and PCP, each block can be connected to two global buses. The regular structure of this 135 architecture enables it to be easily expanded to use more blocks or group of blocks. [7] The realisation of the TriCore processor has another useful feature: it allows on-chip peripherals, and these may be custom-designed. As the core processor takes up a very small area, this means that it is possible to essentially have a custom-designed system-on-chip; for us this means a SOC that is optimised for a variety of multimedia functions. We are in the process of designing several such peripherals, starting with a highly optimised MPEG-2 encoder/decoder. Summary We have described some important features of the TriCore processor and given the results of an evaluation. Those results include a comparison with a conventional microprocessor, and they show the TriCore to have a highly efficient architecture and implementation that is well suited to embedded multimedia applications. The next stage of our work will consist of the design of a multiprocessor and on-chip peripherals, as indicated above. References: [1] [2] [3] [4] [5] [6] Bhargava, R., et.al., “Evaluating MMX technology using DSP and multimedia applications”, 1998. MICRO-31. Proceedings. 31st Annual ACM/IEEE International Symposium on , 1998 , Page(s): 37 –46 Gaborit, L.; et.al, “Evaluating microprocessor multimedia extensions for the real-time simulation of RBF networks”, Conference on Microelectronics for Neural, Fuzzy and Bio-Inspired Systems, 1999. MicroNeuro '99. Proceedings of the Seventh International 1999 , Page(s): 217 –221 Peter Pirsch., et.al., “Implementation of Media Processors”, IEEE Signal Processing Magazine, July 1997 P.M. Embree, “C Algorithm for Real-time DSP”, Prentice Hall 1995 Intel, “MMX Application Notes”, http://developer.intel.com/drg/mmx/appnote s/ “Pentium(r) II Processor Developer Home Page”, 136 http://developer.intel.com/design/pentiumii/ “TriCore Architecture Manual V1.2”, Infineon Technology AG.