A multimedia-evaluation of the Infineon Tricore

advertisement
A multimedia-evaluation of the Infineon TriCore
Ari Wahyudi, Amos R. Omondi, and Thambipilai Srikanthan
School of Computer Engineering
Nanyang Technological University
N4 Nanyang Avenue
SINGAPORE 639798
Ph801819@ntu.edu.sg
Abstract: This paper reports on our evaluation of Infineon TriCore, a DSP-controller processor, on
multimedia applications. We studied signal processing and multimedia hardware support of TriCore.
The evaluation is part of a larger study that aims to use the TriCore as a building block for a
multimedia multiprocessor. Various signal processing and multimedia benchmark programs were
coded in assembly and C language and then run on the TriCore TC10GP and on the Intel Pentium-II, a
typical general-purpose processor. We then performed cost-performance analysis and comparison of
both processors. Our experiments showed that TriCore is a well suited for the designs we envisage.
We also comment on the TriCore’s suitability for embedded multimedia processing
Keywords: multimedia, control, signal processing.
Introduction
Multimedia computation is one of the major
driving forces in the development of high
performance processors, mainly because a typical
multimedia application requires real time signal
processing capability. The Infineon TriCore, a
relatively new microprocessor, utilises a unique
approach in its architecture and implementation
[7]. This study evaluates both the architecture and
implementation of the TriCore on multimedia
applications.
This paper consists of four other sections. The
second section discusses the different natures of
signal and control processing, multimediaprocessing requirements, and how the TriCore’s
features meet these requirements. The third
section reports on our experiments on the TriCore
and the Intel Pentium-II [6]. The fourth section
discusses the TriCore’s suitability as an
embedded processor and as a building block for a
multiprocessor system. And the fifth section is a
concluding summary.
Relevant features of the TriCore
Although the TriCore RISC-DSP promises
opportunity for performance increase, there has
been no independent evaluation of multimedia
applications running on this processor, nor has
there been much discussion of its performance
and cost effectiveness. This study provides such
information.
For comparative evaluation, the other processor
we chose was the Pentium-II MMX, as its
multimedia extension (MMX) is well known and
has been the subject of several evaluations, for
example [1, 2].
Control processing and general purpose
computing are different from digital-signal
processing (DSP) in several aspects: Data
structures in DSP applications are often in form
of vectors, whilst control application use more
conventional data structures and arithmetic (with
enhancements for bit operation and string
processing). Data and program addressing in DSP
are commonly more regular and access certain
locations repeatedly. Program control in DSP is
oriented towards fast execution of tight loops of
code; program branching is not so complex, and
most programs are developed in assembly
language. On the other hand, a controller
128
typically responds to more random and nondeterministic inputs; so operation is highly datadependent and program branching is more
complex. Most controllers do not perform DSP’s
task very well, and most DSPs do not perform
controller’s task very well. The TriCore design
aims to correct this disparity.
Pirsch [3] states that the important characteristics
of a multimedia processor are intensive
computation for highly regular operations,
intensive I/O or memory access, data reusability
and locality, high control complexity in less
computational intensive task, and frequent use of
small integer operands.
A typical way to obtain higher processing power
for multimedia processing is by exploiting
parallelism, either data-level or instruction-level.
The Pentium with MMX technology uses single
instruction multiple data (SIMD) processing, with
eight 64-bit registers that hold 8x8-bit data, or
4x16-bit data, or 2x32-bit data, or 1x64-bit data
to implement the data-level parallelism. TriCore
also utilises SIMD techniques, with sixteen 32-bit
data registers on which to perform the packed
data operation. However, the computational
organisation of packed data for multiplyaccumulate operation on the TriCore is slightly
different from that of the Pentium-II MMX.
The general requirements of multimedia
processing can be handled with typical DSP
functionality, and high control-complexity
requirement of multimedia processing fits with
the functionality of a controller. The Infineon
TriCore, a DSP-RISC processor, was developed
in attempt to combine in a single core the
capabilities of a microcontroller, a digital-signal
processor, and a general-purpose processor. The
TriCore has four main features:
(a) Integrated microcontroller and DSP in single
core. The processor is able to efficiently perform
both control processing and signal processing.
Thus, a wider range of multimedia applications
with different computation complexities can be
handled easily.
(b) Low interrupt latency and fast context
switching capability. The fast context switching
allows clean, fast, and efficient processing of
multiple tasks on one engine. This feature
provides good supports for complex multimedia
application with several multimedia tasks running
in single processor.
(c) Support for peripherals interfacing and
adding custom logic to the core, leading to more
flexibility in realising a multimedia embedded
system.
(d) Powerful I/O support. In TriCore, the I/O
capability is provided by a special I/O processor,
the peripherals control processor (PCP). The PCP
handles inter-peripherals, I/O, and data transfers
(from/to memory) without loading the CPU, thus
leaving the CPU to do other processing tasks.
These four features together make the TriCore a
suitable processor for high-performance, costeffective multimedia processing.
Evaluating TriCore
In this section we discuss the evaluations that we
carried out, the results, and the analysis.
Methodology
The evaluations were performed on TriCore
TC10-GP and Intel Pentium-II MMX 233 MHz
processors. The Intel Pentium-II was on a PC
with Windows 95 operating system, and the
TC10-GP was evaluated using a TriCore
development
board
(TriBoard).
Program
benchmarks were downloaded into the processor
and debugged using the JTAG parallel port
interface that is available on the TriBoard.
We used benchmark programs, consisting of
signal processing kernels and multimedia
applications, to evaluate the processors. Two
versions of codes were used in the experiment:
codes written in C with conventional arithmetic
and optimised versions of the codes; the latter
were mostly written in assembly language to
facilitate the use of the specific multimedia/signal
processing instructions (MMX and SIMD) of the
processors. The codes for the Pentium-II were
compiled using Intel C/C++ compiler; for the
TriCore the codes were compiled using the
HighTec GNU development tool. The C codes for
both processors were compiled using the
optimisations-for-speed option.
129
We used MMX programming support provided
by Intel C/C++ V4.5 compiler to code the
benchmark programs to use MMX instructions;
this programming support made easier the MMX
programming. Since TriCore GNU tools V1.3
does not provide support for packed data (SIMD)
or other TriCore DSP instructions, the assembly
codes generated from original C codes were
edited and optimised to use those instructions.
Complete summarises of hardware, software, and
tuning parameter for the systems evaluated are
shown in Table 2.
Measurement method
We use Pentium’s RDTSC (Read Time Stamp
Counter) instruction to measure the execution
time of the codes; and in TriCore, the execution
time information was obtained by reading the
processor’s system timer register 0. This method
allows us to obtain the execution time in terms of
CPU clock cycles.
Execution times for each code were obtained by
executing the code three times inside a loop, and
then one execution time was selected. This was
done to ensure that the program and data caches
of the processors were loaded with the
appropriate program/data and, therefore, that
execution time reported from the experiment
represents the best processor’s performance.
Metrics
The main parameters of concern in the
experiment are the clock-cycle efficiency, the
execution speed, and the speed-up obtained from
the SIMD/MMX/DSP hardware support of the
processors. The implementation costs of the two
processors relative to their performance are also
analysed. The cost metrics used in the experiment
are chip area, number of transistors, and power
consumption.
Benchmarks
We used more signal-processing kernels and
multimedia applications than has been done in
similar analysis by other researchers. We did this
in order to ensure that the study would provide
more reliable justification for the system being
evaluated.
The SIMD/MMX codes were modified to use
assembly language with processor-specific
optimisation; some of the modifications were
taken from Intel and Infineon libraries and
application notes, in order to ensure that the
implementations were optimum and reliable. The
other kernel implementations (not provided by
Intel and Infineon libraries and application notes)
were optimised by adapting the optimisation
strategies used for the other kernels in the
libraries and application notes. Error! Reference
source not found. shows the implementation
parameters of the kernels and multimedia
applications used in the evaluation:
Finite Impulse Response (FIR):
y ( n) 
M 1
c
k 0
k
 x(n  k )
Our FIR filters in SIMD/MMX/non-SIMD/nonMMX implementa-tion use 16-bit integers.
Optimisations used in the FIR implementation on
Pentium-II are the MMX instruction set, placing
instruction sequence in ‘good’ order (that allows
execution of up to three instructions per cycle),
loop unrolling, and optimal alignment of data in
memory. TriCore has powerful multiplyaccumulate instructions that perform very well
with packed data types. TriCore also allows 64bit data-loading operation in parallel with the
packed multiply-accumulate operation. This
capability enables the processor to perform true
two
16-bit
integer
multiply-accumulate
operations in one cycle. Optimisations used for
the FIR filter on TriCore are in loading/storing
packed data, packed arithmetic, and zerooverhead loops.
Infinite Impulse Response (IIR):
Q 1
P 1
q 0
p 0
y (n)   bq x(n  k )   a p y (n  p)
Both types of code (basic arithmetic and
optimised) perform the computation with 16-bit
integers. For both processors, the implementation
strategy used for this filter is the similar to filter
that used for the FIR.
130
Matrix-vector arithmetic, consisting of vector
dot-product and matrix-vector multiplication. Our
implementations use 16-bit integers.
Table 1 Summary of kernels and applications
FIR
Integer 16 bit data type, len. 13, 140 pt
IIR
Integer 16 bit data type, len. 13, 25 coef., 140 pt
MatVect Matrix [512][512] and vector [512] multiplication
(16 bit integer)
VecDotP Two vector[512] dot product (16-bit integer)
LMS
Integer 16-bit data type, 351 samples,
filter order: 20
ADPCM Integer 16 bit data type, 6000 samples,
test file: “chk.wav”
FFT 1D Complex, 16 bit integer data types, 4096 pt
FFT 2D Complex, 16 bit integer data types, 16x16 pt
MPEG-2 3 frames, YUV, 4:2:0, 256x256 pixel, 256 color,
test file: “pingpong”
The implementation strategy for this algorithm on
both processors is similar to the core strategy
used for the other algorithms above that perform
multiply-accumulate operation.
Least Mean Square (LMS) Adaptive Filter. This
algorithm attempts to find an optimum set of
filter parameters based on the time-varying input
and output signals:
Q 1
y ( n)   b q ( k ) x ( n  q )
q 0
where b(k) is the time-varying coefficients of the
filter. The filter implementation used here is
based on [4] and utilises multiply-accumulate
operations. The SIMD/MMX version of this code
(for both processors) was modified by optimising
only the multiply-accumulate part of the
algorithm.
ADPCM G.722. This is a speech-encoding
standard for compressing and decompressing
speech and audio signals whose frequency range
from 50 Hz to 7000 Hz. As with the LMS filter
implementation, the only part of this algorithm
that were optimised are the looping and multiplyaccumulate operations.
Fast Fourier Transform (FFT). This is an
efficient algorithm for computing discrete Fourier
transform (DFT) of a sequence:
X (k )  X ev (n)  W k N / 2 X od (n)
where Xev represents even-indexed elements and
Xod represents od-indexed elements. We use inplace, radix-2, decimation-in-time FFT for the
experiment. The implementations use 16-bit
integers.
MPEG-2 Compression. This is a standardised
compression method for moving images. The
primary parts of the MPEG are discrete cosine
transform
(DCT),
quantisation,
motion
estimation, Huffman coding, and run-length
coding. We used MSSG (MPEG Software
Simulation Group) code for the MPEG-2
implementation. For both processors, the
optimisations were done for the motion
estimation and DCT components. The DCT
implementations on both processors use 16-bit
integer data and SIMD arithmetic. Block-distance
calculation (a major part of motion estimation) on
Pentium-II was easily realised with the MMX
instruction set. In TriCore SIMD operations
cannot be used to optimise the block-distance
calculation; this is because of data alignment
restrictions on the TriCore, which is not allowing
sequences of data to be loaded/stored if the
source/target memory location is not aligned to
four. So the optimisation used for the blockdistance calculation in the TriCore uses TriCore’s
abs instruction to compute the absolute-difference
value of two pixels in block-distance
computation.
Results and discussion
The measurements were repeated several times
for each algorithm. For the large and complex
algorithms (MPEG, LMS, and ADPCM), the
measurements produced slightly different results
for each algorithm; we give the smallest numbers
obtained from several measurements. In what
follows, we shall use the term relative
performance of TriCore to Pentium, to mean the
ratio of performance of an algorithm on TriCore
to performance of the same algorithm on
Pentium-II. Raw performance here is measured in
terms of number of cycles.
Tables 3 and 4 show the basic results from the
experiments. The results show that for some
programs, MMX, SIMD, and DSP supports on
both processors provide significant speed-up over
traditional implementations. For the TriCore, the
speed-up ranges from 1.63 to 12.27, for the
131
Pentium II
Model Number
Clock speed
FPU
Primary cache
Secondary cache
Other cache
Memory (internal)
Memory (external)
Other hardware
O/S and version
Compilers and version
Other software
TriCore TC10GP
Hardware Parameters:
Pentium-II 233
TC10GP
233 MHz
16 MHz (on board)
Integrated
None
16 KB (Inst.)
8 or 16 KB (Inst.) *
16 KB (Data)
0 or 16 KB (Data) **
512 KB
None
None
None
None
0 or 8 KB inst. SRAM *
None
16 or 32 KB data SRAM **
None
None
None
None
64 MB (DRAM)
4 MB (SDRAM)
2 MB (Flash)
None
None
Software Parameters:
Windows 95
None
Intel C Compiler
GNU/TriCore V1.3
V4.5
Intel Vtune V4.5,
Tasking tool
Microsoft Visual
Studio 6.0
*: Can be configured in two ways:
1: 8 KB instruction SRAM, 8 KB instruction cache
2: 16 KB program instruction cache only (no scratch-pad instruction SRAM)
**: Can be configured in two ways:
1: 32 KB data SRAM only (no cache)
2: 16 KB data SRAM and 16 KB data cache
Table 2 Machine, software, and baseline tuning parameters
kernels and 1.1 to 1.62 for the applications; and
in the Pentium-II, the speed-up ranges from 1.33
to 3.47 for the kernels and 1.01 to 1.73 for the
applications. This suggests that the TriCore has a
better architecture (ISA), although the results are
slightly affected by the quality of the compiler
used, since the Intel C/C++ compiler technology
of Intel machines is more mature than the GNU C
compiler of the TriCore.
the computation is in the motion estimation
means that the speed-up for the overall operation
is quite small.
Table 3 Execution times and speed-up of benchmarks
Type
Unoptimised
MMX/SIMD
Unoptimised
VecDotP
MMX/SIMD
Unoptimised
FIR
MMX/SIMD
Unoptimised
IIR
MMX/SIMD
Unoptimised
FFT 1D
MMX/SIMD
Unoptimised
FFT 2D
MMX/SIMD
Unoptimised
LMS
MMX/SIMD
Unoptimised
ADPCM
MMX/SIMD
Unoptimised
MPEG-2
MMX/SIMD
Unoptimised
Average
MMX/SIMD
MatVect
The TriCore with SIMD and hand-optimised
codes generally provides better speed-up than the
Pentium-II with MMX support, except for
MPEG-2, for which the Pentium-II with MMX
achieves speed-up of 1.73, while the TriCore is
only able to achieve 1.62. The most significant
factor which influences the better speed-up of
MPEG-2 in the Pentium-II is the optimisation
performed on the motion estimation part; the
motion estimation part in TriCore could not be
optimised with the SIMD/packed data arithmetic.
We examined each component of the MPEG-2
and found that in TriCore, the factor that most
influenced speed-up is optimisation in the
forward DCT part; however, the fact that most of
132
Pentium
TC10-GP
Clk. cycles Speed- Clk. Cycles Speedup
up
1757121
3937432
2.41
5.4
729298
729405
1272
1930
3.47
5.74
367
336
3958
17513
3.12
8.65
1269
2052
8106
36276
3.14
12.27
2583
2957
793075
1250987
1.33
1.88
595229
664345
42437
53074
1.33
1.63
31812
32612
99138
241854
1.01
1.43
98516
169382
4379791
6437605
1.1
1.1
3988417
5823432
144345486
149460673
1.73
1.62
83646350
92349803
16825598
17937483
2.07
4.40
9899315
11086036
Except for vector dot-product, the other signal
processing kernels and applications require fewer
CPU cycles in the Pentium than in the TriCore
mainly because the Pentium has a more
aggressive implementation. In vector dot-product
kernels, the TriCore has the better performance:
the TriCore relative performance to Pentium-II is
1.09; and in matrix-vector multiplication kernel,
the TriCore performance is nearly the same as the
Pentium’s. The TriCore seems to perform very
well in algorithms with tight looping and highly
regular multiply-accumulate operations, such as
the vector-dot product and matrix-vector
multiplication algorithm.
For both one and two-dimensional FFT, the
TriCore shows almost similar performance to the
Pentium: the TriCore relative performance is
0.8960 for one-dimensional FFT and 0.9755 for
two-dimensional FFT. The speed-up of the
optimised FFT code in the TriCore is much better
than that of the Pentium: 1.88 and 1.63 on
TriCore, and 1.33 and 1.33 on Pentium-II, for 1D and 2-D FFT respectively. The source codes
reveal that the higher speed-up for FFT code in
TriCore is the result of not just from the usage of
SIMD instruction set: there is also a reduction in
data-register requirements that would otherwise
cause memory transfers.
In running FIR and IIR, the implementation on
the Pentium-II requires fewer cycles than the
TriCore: the TriCore relative performance is
0.6184 for the FIR and 0.874 for the IIR.
In other applications (LMS and ADPCM), the
results also show that the Pentium-II MMX
requires fewer cycles than the TriCore: TriCore’s
relative performance to Pentium is 0.5816 for
LMS and 0.6849 for ADPCM. The more irregular
structure and fewer tight-loops with multiplyaccumulate operation in the LMS and ADPCM
algorithms mean that they can be executed in
Pentium in fewer cycles than on the TriCore.
LMS and ADPCM have relatively fewer signals
processing instructions than the other kernels and
applications, which also mean that only small
speed up could be achieved from the optimisation
performed on both processors.
In TriCore, the FIR’s and IIR’s inner-loop kernel
were first coded to use the zero-overhead-loop
instruction, an instruction that eliminates the
overheads of a conditional jump instruction that is
normally at the end of an instruction sequence
running in a loop; this is achieved by setting the
address register to automatically point to a certain
memory location. But the implementation with
that instruction operates more slowly than
without. The FIR and IIR were then coded to use
loop-unrolling optimisation. It seems that for
relatively small filter lengths (fifteen as we used
in the experiment), the zero-overhead loop
instruction generates some overheads because of
the initialisation required. The loop-unrolling
implementation requires more static code space
than the zero-overhead loop implementation, but
from our observation the amount of space was not
very significant.
Table 4 TriCore relative performance to Pentium-II
Algorithm
MatVect
VecDotP
FIR
IIR
FFT 1D
FFT 2D
LMS
ADPCM
MPEG-2
Relative
Performance
1.0000
1.0923
0.6184
0.8735
0.8960
0.9755
0.5816
0.6849
0.9058
TriCore provides a true multiply-accumulate
instruction and can also carry out such an
instruction in parallel with a 64-bit data transfer
operation (memory-to-register or register-tomemory). On the other hand, data packing and
unpacking operations in the MMX generate
overheads in the execution of multiplyaccumulate operation, as well as in the
requirement of shift operations to prevent data
overflows. In TriCore, the shift operations are not
necessary because the computations are arranged
to use the wider accumulator register (64-bit).
Maximum throughput in Pentium with MMX is
eight multiply-accumulate operations of 16-bit
number in three cycles [5], i.e. about 2.67
instructions per cycle; whereas the TriCore is able
to execute at most two multiply-accumulate
operations of 16-bit numbers in one cycle.
Combined with special addressing modes, zerooverhead loop, and wider accumulator register,
the TriCore can achieve a clock-cycle efficiency
comparable to that of the Pentium-II but with less
133
code space. The packed-data arithmetic combined
with parallel data loading is a powerful DSP
feature of the TriCore.
Table 5 shows CPI information of both
processors for certain kernels. (CPI is defined as
CPU clock cycles for a program per dynamic
instruction count). The CPIs on Pentium-II was
obtained using Intel VTune performance analyser
tool. In TriCore, due to the tool unavailability, we
simply manually counted the number of dynamic
instructions of certain codes and combined this
with CPU cycle information to obtain the CPIs.
Table 5 CPI of optimised codes
Pentium-II MMX
CPU
IC
CPI
cycles
FIR
1269
2040 0.622
IIR
2583
4039 0.639
MatVect 729298 394482 1.849
VecDotP
367
631 0.582
TriCore TC10GP
CPU
IC
CPI
cycles
2052
2806
2957
4560
729405 330760
336
651
multiply-accumulate
operation
TriCore’s
performance is comparable to that of the
Pentium-II MMX. This attests to the superiority
of TriCore’s architecture in this regard.
Cost-performance analysis
Table 6 lists the costs variables used in our
experiment. The costs are measured in three
ways: number of transistors on the core, die size,
and power dissipation. We are interested in these
parameters in order to factor out implementation
(micro-architecture) and realisation (technology)
from architecture comparisons and also in order
to evaluate the TriCore for embedded multimedia
applications.
Table 6 Scaled cost parameters
0.731
0.648
2.205
0.516
Based on the CPIs, generally, Pentium-II has a
better instruction parallelism than the TriCore.
Nevertheless, these figures do not indicate that
the Pentium has either a better architecture or a
better implementation. The Pentium-II microarchitecture is highly superscalar, with much
deeper pipelines, and so naturally ought to
perform better in terms of clock cycles. A
complete comparison of the two microprocessors
should factor in the cost of realising the
Pentium’s aggressive implementation. When this
is done, the TriCore is shown to have the superior
architecture. For the vector dot-product kernel,
the result for TriCore shows a better CPI than for
the Pentium-II. This is because the multiplyaccumulate operation can be performed very
Area
Num. Of Transistors
Power dissipation.
Pentium-II
131 mm2
38.5 million
20.988 Watt
TriCore
5 mm2
5 million
0.3495 Watt
Table 7 Cost: performance of Pentium-II
CPU
Cost1*C
Cost2*C
Cost3*C
Cycles (C) mm2 x 10-7 million x 10-7 Watt x10-7
FIR
1269
0.0166
0.0049
0.0027
IIR
2583
0.0338
0.0099
0.0054
MatVect
658744
8.6295
2.5362
1.3826
VecDotP
367
0.0048
0.0014
0.0008
LMS
100610
1.3180
0.3873
0.2112
ADPCM
3988417
52.2483
15.3554
8.3709
FFT 1D
595229
7.7975
2.2916
1.2493
FFT 2D
31812
0.4167
0.1225
0.0668
MPEG
83646350 1095.7672
322.0384 176.5570
Average:
129.5814
38.0831
20.7607
Average ***:
8.8082
2.5887
1.4112
Table 9 Cost: performance of TriCore TC10GP
Table 6 Cost parameters
Pentium-II
Area
202 mm2
Num. Of Transistors
38.5 million **
Power dissipation.
34.8 Watt
Process technology
0.35 micron
Clock rate
233 MHz
Voltage
2.8 V
approximate, ** includes L2 cache
TriCore
5 mm2
5 million *
1.5mW/MHz
0.25 micron
66 MHz
2.25 V
efficiently on the TriCore. Although the TriCore
has less a smaller data register width (32-bit) than
the Pentium with MMX has (64-bit), for a pure
CPU
Cost1*C
Cost2*C
Cost3*C
Cycles (C) mm2 x10-7 million x10-7 Watt x10-7
FIR
2052
0.00103
0.00103
0.00007
IIR
2957
0.00148
0.00148
0.00010
MatVect
729405
0.36470
0.36470
0.02549
VecDotP
336
0.00017
0.00017
0.00001
LMS
169382
0.08469
0.08469
0.00592
ADPCM
6289027
2.91172
2.91172
0.20353
FFT 1D
664345
0.33217
0.33217
0.02322
FFT 2D
32612
0.01631
0.01631
0.00114
MPEG
92349803
46.17490
46.17490
3.22763
Average:
5.54302
5.54302
0.38746
Average ***:
0.46403
0.46403
0.03244
*** average without MPEG-2
134
Because of the differences apparent in Table 6,
the raw numbers obtained cannot be used other
than to compare raw performance. To make
broader and more meaningful comparisons, we
scaled the figures so that both sets correspond to
0.25 micron, 233 MHz, and 2.25V. The scaling is
not very precise but is adequate for broad
comparisons. The results of the scaling are shown
in Table 7. Using these, we then obtained the
cost: performance figures given in Tables 8 and
9; here, cost1, cost2, and cost3 are chip area,
number of transistors, and power, respectively.
The results show that TriCore consistently has the
better cost: performance ratios. Put another way,
for the same cost, the TriCore has the better
performance. These suggest that the quality of
TriCore’s architecture and implementation are
appropriate for the uses we have in mind –
embedded multimedia processing, with both
single and multiple processors. We next comment
on other features of the TriCore that fit in with
these goals.
write/read access to peripheral/memory/register
on the other host, which in turn allows fast data
transfer without the necessity to load both CPUs
for the data task. Bus sharing also means
additional overheads to the external bus unit, but,
if frequently accessed data or code are located in
the internal memory, and the external memory is
used to store only the data that will be transferred,
then the external bus overhead would be greatly
reduced.
block 1
TriCore
1
BUS
TriCore
2
Memory
I/O
port
TriCore
1
BUS
I/O
port
TriCore
2
Memory
TriCore-based systems
Global
BUS 1
TriCore has several features to support a
multiprocessor system: a fast context-switching
capability, powerful I/O support (via a
Peripherals Control Processor), and a bus-sharing
mechanism.
Global
BUS 2
block n
I/O
port
TriCore
1
BUS
I/O
port
TriCore
2
Memory
Inter-node communication mechanism (blocking
or non-blocking) in a multiprocessor system
typically requires task-switching between data
processing task and the data communication
handler task. In an application with frequent
inter-node data transfers, the fast contextswitching feature of the TriCore would reduce the
CPU time taken up by the inter-node
communication task. The fast context-switching
may also be used to support multithreading within
a single processor running different multimedia
applications. The PCP (Peripherals Control
Processor) performs tasks that in a traditional
computer system are normally performed by a
combination of a DMA controller and its
supporting CPU interrupts service routine. The
PCP improves the responsiveness of interrupt
service in data transfer and data capture
operations. The bus sharing mechanism enables
two TriCore processors to be connected with their
external bus being shared without the need of
additional glue logic. This feature enables direct
1 Group
of block
Figure 1 TriCore-multiprocessor architecture
Figure 1 shows an example of a simple
organisation for a TriCore-based parallel
processor. The system has several blocks, each of
which consists of two TriCore processors. Interprocessor communication in one block is
performed by using the memory and bus sharing
mechanism; and inter-block communication is
performed through the PCP-controlled I/O port.
The global bus is used for inter-block data
transfers. Since each block has two TriCore
processors, and each processor has its own I/O
port and PCP, each block can be connected to two
global buses. The regular structure of this
135
architecture enables it to be easily expanded to
use more blocks or group of blocks.
[7]
The realisation of the TriCore processor has
another useful feature: it allows on-chip
peripherals, and these may be custom-designed.
As the core processor takes up a very small area,
this means that it is possible to essentially have a
custom-designed system-on-chip; for us this
means a SOC that is optimised for a variety of
multimedia functions. We are in the process of
designing several such peripherals, starting with a
highly optimised MPEG-2 encoder/decoder.
Summary
We have described some important features of the
TriCore processor and given the results of an
evaluation. Those results include a comparison
with a conventional microprocessor, and they
show the TriCore to have a highly efficient
architecture and implementation that is well
suited to embedded multimedia applications. The
next stage of our work will consist of the design
of a multiprocessor and on-chip peripherals, as
indicated above.
References:
[1]
[2]
[3]
[4]
[5]
[6]
Bhargava, R., et.al., “Evaluating MMX
technology using DSP and multimedia
applications”,
1998.
MICRO-31.
Proceedings. 31st Annual ACM/IEEE
International Symposium on , 1998 ,
Page(s): 37 –46
Gaborit,
L.;
et.al,
“Evaluating
microprocessor multimedia extensions for
the real-time simulation of RBF networks”,
Conference on Microelectronics for Neural,
Fuzzy and Bio-Inspired Systems, 1999.
MicroNeuro '99. Proceedings of the Seventh
International 1999 , Page(s): 217 –221
Peter Pirsch., et.al., “Implementation of
Media Processors”, IEEE Signal Processing
Magazine, July 1997
P.M. Embree, “C Algorithm for Real-time
DSP”, Prentice Hall 1995
Intel,
“MMX
Application
Notes”,
http://developer.intel.com/drg/mmx/appnote
s/
“Pentium(r) II Processor Developer Home
Page”,
136
http://developer.intel.com/design/pentiumii/
“TriCore Architecture Manual V1.2”,
Infineon Technology AG.
Download