Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
© 2007 Elsevier
Embedded microprocessor market.
Categories of CPUs.
RISC, DSP, and Multimedia processors.
CPU mechanisms.
High Performance Embedded Computing
© 2007 Elsevier
Embedded processors account for
Over 97% of total processors sold
Over 60% of total sales from processors
Sales expected to increase by roughly 15% each year
High Performance Embedded Computing
© 2007 Elsevier
Single-instruction single-data (SISD)
Single-instruction multiple-data (SIMD)
Multiple-instruction multiple-data (MIMD)
Multiple-instruction single data (MISD)
What is an example of each?
Which would you expect to see in embedded systems?
High Performance Embedded Computing
© 2007 Elsevier
RISC vs. CISC---Instruction set style.
Instruction issue width.
Static vs. dynamic scheduling for multipleissue machines.
Scalar vs. vector processing.
Single-threaded vs. multithreading.
A single CPU can fit into multiple categories.
High Performance Embedded Computing
© 2007 Elsevier
Embedded vs. general-purpose processors
Embedded processors may be customized for a category of applications.
Customization may be narrow or broad.
We may judge embedded processors using different metrics:
Code size.
Energy efficiency.
Memory system performance.
Predictability.
High Performance Embedded Computing
© 2007 Elsevier
RISC processors often have simple, highlypipelinable instructions
Pipelines of embedded
RISC processors have grown over time:
ARM7 has 3-stage pipeline.
ARM9 has 5-stage pipeline
ARM11 has 8-stage pipeline.
ARM11 pipeline [ARM05].
High Performance Embedded Computing
© 2007 Elsevier
ARM:
ARM7 has in-order execution, and no memory management or branch prediction;
ARM9 ARM11 has out of order execution, memory management, and branch prediction,
MIPS:
MIPS32 4K has 5-stage pipeline;
4KE family has DSP extension;
4KS is designed for security.
PowerPC:
PowerPC 400 series includes several embedded processors;
Motorola and IBM offer superscalar versions of the PowerPC
High Performance Embedded Computing
© 2007 Elsevier
Embedded DSP processors are optimized to perform
DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms
y k
n
N
0 b x
DSP processors feature
Deterministic execution times
Fast multiply-accumulate instructions
Multiple data accesses per cycle
Specialized addressing modes
Efficient support for loops and interrupts
Efficient processing of “streaming” data
High Performance Embedded Computing
© 2007 Elsevier
40-bit arithmetic (32-bit values + 8 guard bits).
Barrel shifter.
17 x 17 multiplier.
Two address generators.
Lots of special purpose registers and addressing modes
Coprocessors for compute-intensive functions including pixel interpolation, motion estimation, and DCT/IDCT computations
High Performance Embedded Computing
© 2007 Elsevier
High Performance Embedded Computing
© 2007 Elsevier
Static:
Use compiler to analyze program.
Simpler CPU.
Can’t depend on data values.
VLIW
Dynamic:
Use hardware to identify opportunities.
More complex CPU.
Can make use of data values.
Superscalar
High Performance Embedded Computing
© 2007 Elsevier
Each very long instruction word (VLIW) erforms multiple operations in parallel
Branch Memory Memory Arithmetic Logic Vector
Needs a good compiler that understands the architecture
Allows deterministic execution times
Code growth can be reduced by allowing
Operations within an instruction to be performed sequentially
A given field to specify different types of operations
Seq Branch/Mem Mem/Arith Arith/Logic Vector
High Performance Embedded Computing
© 2007 Elsevier
Large register file feeds multiple function units.
E box
Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
Register file
ALU ALU Load/store Load/store FU
High Performance Embedded Computing
© 2007 Elsevier
Clustered VLIW architecture
Register file, function units divided into clusters.
What are advantages/disadvantages of having clusters in VLIW architectures?
Cluster bus
Execution
Register file
Execution
Register file
High Performance Embedded Computing
© 2007 Elsevier
VLIW with up to 8 instructions/cycle.
32 32-bit registers.
Function units:
Two multipliers.
Six ALUs.
All instructions execute conditionally.
High Performance Embedded Computing
© 2007 Elsevier
8/16/32-bit arithmetic.
40-bit operations.
Bit manipulation operations.
C67x processors add floating-point arithmetic.
High Performance Embedded Computing
© 2007 Elsevier
Program RAM/cache
512K bits bus
Data RAM
512K bits
Execute
DMA
Data path 1/
Reg file 1
Data path 2/
Reg file 2
High Performance Embedded Computing
© 2007 Elsevier
JTAG timers
Serial
PLL
IEEE Signal Processing Magazine , v. 15, no. 2, pp. 86-101, 117, 1998.
Parallelism at multiple levels
Multiple processors
System-on-a-chip designs
Multiple simultaneous tasks
Multithreaded processors
Multiple instruction per cycle
Very Long Instruction Word (VLIW) architectures
Multiple operation per instruction
Single Instruction Multiple Data (SIMD) instructions
Architecture/compiler pairs improve performance and help manage application complexity
High Performance Embedded Computing
© 2007 Elsevier
Instructions are dynamically scheduled.
Dependencies are checked at run time in hardware.
Used to some extent in embedded processors.
Embedded Pentium is two-issue in-order.
Some PowerPCs are superscalar
What advantages/disadvantages do VLIW processors compared to superscalar?
High Performance Embedded Computing
© 2007 Elsevier
Many special-purpose SIMD machines
All processors perform same operation on different data
Subword parallelism is widely used for video.
ALU is divided into subwords for independent operations on small operands.
Vector processing is another form of SIMD processing
Lots of times these terms are interchanged
High Performance Embedded Computing
© 2007 Elsevier
Recent multimedia processors commonly support
Single Instruction Multiple data (SIMD) instructions
The same operation is performed on multiple data operands using a single instruction
A3 A2 A1 A0
B3
A3+B3
B2
A2+B2
B1
A1+B1
B0
A0+B0
Exploits low precision and high data parallelism of multimedia applications
High Performance Embedded Computing
© 2007 Elsevier
Operand characteristics in MediaBench
High Performance Embedded Computing
© 2007 Elsevier
Dynamic behavior of loops in MediaBench
The loops of media applications in many cases are not very deep
Path ratio =
(instructions executed per iteration) / (total number of loop instructions).
What does the path ratio reveal?
High Performance Embedded Computing
© 2007 Elsevier
Characteristics
Floating point support
Sub-word parallelism support
VLIW
Additional custom operations
High Performance Embedded Computing
© 2007 Elsevier
video in audio in
I 2 C timers image co-p
PCI memory interface video out audio out serial
VLD co-p
VLIW CPU
High Performance Embedded Computing
© 2007 Elsevier
register file
FU1 read/write crossbar
...
FU27 slot 1 slot 2 slot 3 slot 4 slot 5
High Performance Embedded Computing
© 2007 Elsevier
Low-level parallelism mechanism.
Interleaved multithreading (IMT) alternately fetches instructions from separate threads.
Often used with VLIW and vector processors
Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle.
Often used with superscalar processors
What advantages/disadvantages does IMT have relative to SMT?
High Performance Embedded Computing
© 2007 Elsevier
Power scales with V 2 while performance scales roughly as V.
Reduce operating voltage, add parallel operating units to make up for lower clock speed.
DVS doesn’t work well in processors with highleakage power.
High Performance Embedded Computing
© 2007 Elsevier
Dynamic voltage and frequency scaling
(DVFS)
Scale both voltage and clock frequency.
Can use control algorithms to match performance to application, reduce power.
High Performance Embedded Computing
© 2007 Elsevier
Razor runs clock faster than worst case allows
Used specialized latch to detect errors.
Recovers only on errors, gains averagecase performance.
High Performance Embedded Computing
© 2007 Elsevier