PPT - ECE 751 Embedded Computing Systems

Lecture 6: Embedded

Processors

Embedded Computing Systems

Mikko Lipasti, adapted from M. Schulte

Based on slides and textbook from Wayne Wolf

© 2007 Elsevier

Topics

 Embedded microprocessor market.

 Categories of CPUs.

 RISC, DSP, and Multimedia processors.

 CPU mechanisms.

High Performance Embedded Computing

© 2007 Elsevier

Demand for Embedded Processors

 Embedded processors account for





Over 97% of total processors sold

Over 60% of total sales from processors

 Sales expected to increase by roughly 15% each year


© 2007 Elsevier

Flynn’s taxonomy of processors

 Single-instruction single-data (SISD)

 Single-instruction multiple-data (SIMD)

 Multiple-instruction multiple-data (MIMD)

 Multiple-instruction single data (MISD)

 What is an example of each?

 Which would you expect to see in embedded systems?


© 2007 Elsevier

Other axes of comparison

 RISC vs. CISC---Instruction set style.

 Instruction issue width.

 Static vs. dynamic scheduling for multipleissue machines.

 Scalar vs. vector processing.

 Single-threaded vs. multithreading.

 A single CPU can fit into multiple categories.


© 2007 Elsevier

Embedded vs. general-purpose processors

 Embedded processors may be customized for a category of applications.



Customization may be narrow or broad.

 We may judge embedded processors using different metrics:



Code size.



Energy efficiency.



Memory system performance.



Predictability.


© 2007 Elsevier

Embedded RISC processors





RISC processors often have simple, highlypipelinable instructions

Pipelines of embedded

RISC processors have grown over time:







ARM7 has 3-stage pipeline.

ARM9 has 5-stage pipeline

ARM11 has 8-stage pipeline.

ARM11 pipeline [ARM05].


© 2007 Elsevier

RISC processor families

 ARM:





ARM7 has in-order execution, and no memory management or branch prediction;

ARM9 ARM11 has out of order execution, memory management, and branch prediction,

 MIPS:



MIPS32 4K has 5-stage pipeline;





4KE family has DSP extension;

4KS is designed for security.

 PowerPC:





PowerPC 400 series includes several embedded processors;

Motorola and IBM offer superscalar versions of the PowerPC


© 2007 Elsevier

Embedded DSP Processors



 Embedded DSP processors are optimized to perform

DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms

 y k

 n

N 



0 b x

DSP processors feature

Deterministic execution times











Fast multiply-accumulate instructions

Multiple data accesses per cycle

Specialized addressing modes

Efficient support for loops and interrupts

Efficient processing of “streaming” data


© 2007 Elsevier

Example: TI C55x/C54x DSPs

 40-bit arithmetic (32-bit values + 8 guard bits).





Barrel shifter.

17 x 17 multiplier.







Two address generators.

Lots of special purpose registers and addressing modes

Coprocessors for compute-intensive functions including pixel interpolation, motion estimation, and DCT/IDCT computations


© 2007 Elsevier

TI C55x microarchitecture


© 2007 Elsevier

Parallelism extraction

 Static:









Use compiler to analyze program.

Simpler CPU.

Can’t depend on data values.

VLIW

 Dynamic:









Use hardware to identify opportunities.

More complex CPU.

Can make use of data values.

Superscalar


© 2007 Elsevier

VLIW architectures

 Each very long instruction word (VLIW) erforms multiple operations in parallel







Branch Memory Memory Arithmetic Logic Vector

Needs a good compiler that understands the architecture

Allows deterministic execution times

Code growth can be reduced by allowing



Operations within an instruction to be performed sequentially



A given field to specify different types of operations

Seq Branch/Mem Mem/Arith Arith/Logic Vector


© 2007 Elsevier

Simple VLIW architecture

 Large register file feeds multiple function units.

E box

Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU


© 2007 Elsevier

Clustered VLIW architecture





Register file, function units divided into clusters.

What are advantages/disadvantages of having clusters in VLIW architectures?

Cluster bus

Execution

Register file

Execution

Register file


© 2007 Elsevier

TI C62x/C67x DSPs

 VLIW with up to 8 instructions/cycle.

 32 32-bit registers.

 Function units:



Two multipliers.



Six ALUs.

 All instructions execute conditionally.


© 2007 Elsevier

TI C6x data operations

 8/16/32-bit arithmetic.

 40-bit operations.

 Bit manipulation operations.

 C67x processors add floating-point arithmetic.


© 2007 Elsevier

C6x block diagram

Program RAM/cache

512K bits bus

Data RAM

512K bits

Execute

DMA

Data path 1/

Reg file 1

Data path 2/

Reg file 2


© 2007 Elsevier

JTAG timers

Serial

PLL

Texas Instruments C62x

IEEE Signal Processing Magazine , v. 15, no. 2, pp. 86-101, 117, 1998.

Emerging DSP Architectures





Parallelism at multiple levels









Multiple processors

 System-on-a-chip designs

Multiple simultaneous tasks

 Multithreaded processors

Multiple instruction per cycle

 Very Long Instruction Word (VLIW) architectures

Multiple operation per instruction



Single Instruction Multiple Data (SIMD) instructions

Architecture/compiler pairs improve performance and help manage application complexity


© 2007 Elsevier

Superscalar processors

 Instructions are dynamically scheduled.



Dependencies are checked at run time in hardware.

 Used to some extent in embedded processors.



Embedded Pentium is two-issue in-order.



Some PowerPCs are superscalar

 What advantages/disadvantages do VLIW processors compared to superscalar?


© 2007 Elsevier

SIMD and subword parallelism

 Many special-purpose SIMD machines



All processors perform same operation on different data

 Subword parallelism is widely used for video.



ALU is divided into subwords for independent operations on small operands.

 Vector processing is another form of SIMD processing

 Lots of times these terms are interchanged


© 2007 Elsevier

SIMD Instructions





Recent multimedia processors commonly support

Single Instruction Multiple data (SIMD) instructions

The same operation is performed on multiple data operands using a single instruction

A3 A2 A1 A0

B3

A3+B3

B2

A2+B2

B1

A1+B1

B0

A0+B0

 Exploits low precision and high data parallelism of multimedia applications


© 2007 Elsevier

Operand characteristics in MediaBench


© 2007 Elsevier

Dynamic behavior of loops in MediaBench







The loops of media applications in many cases are not very deep

Path ratio =

(instructions executed per iteration) / (total number of loop instructions).

What does the path ratio reveal?


© 2007 Elsevier

TriMedia TM-1 characteristics

 Characteristics

 Floating point support



Sub-word parallelism support



VLIW



Additional custom operations


© 2007 Elsevier

Trimedia TM-1

video in audio in

I 2 C timers image co-p

PCI memory interface video out audio out serial

VLD co-p

VLIW CPU


© 2007 Elsevier

TM-1 VLIW CPU

register file

FU1 read/write crossbar

...

FU27 slot 1 slot 2 slot 3 slot 4 slot 5


© 2007 Elsevier

Multithreading

 Low-level parallelism mechanism.





Interleaved multithreading (IMT) alternately fetches instructions from separate threads.



Often used with VLIW and vector processors

Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle.



Often used with superscalar processors

 What advantages/disadvantages does IMT have relative to SMT?


© 2007 Elsevier

Dynamic voltage scaling (DVS)







Power scales with V 2 while performance scales roughly as V.

Reduce operating voltage, add parallel operating units to make up for lower clock speed.

DVS doesn’t work well in processors with highleakage power.


© 2007 Elsevier

Dynamic voltage and frequency scaling

(DVFS)





Scale both voltage and clock frequency.

Can use control algorithms to match performance to application, reduce power.


© 2007 Elsevier

Razor architecture







Razor runs clock faster than worst case allows

Used specialized latch to detect errors.

Recovers only on errors, gains averagecase performance.


© 2007 Elsevier

PPT - ECE 751 Embedded Computing Systems

Lecture 6: Embedded

Processors

Topics

Demand for Embedded Processors

Flynn’s taxonomy of processors

Other axes of comparison

Embedded RISC processors

RISC processor families

Embedded DSP Processors

Example: TI C55x/C54x DSPs

TI C55x microarchitecture

Parallelism extraction

VLIW architectures

Simple VLIW architecture

TI C62x/C67x DSPs

TI C6x data operations

C6x block diagram

Texas Instruments C62x

Emerging DSP Architectures

Superscalar processors

SIMD and subword parallelism

SIMD Instructions

TriMedia TM-1 characteristics

Trimedia TM-1

TM-1 VLIW CPU

Multithreading

Dynamic voltage scaling (DVS)

Razor architecture

Related documents

Products

Support

PPT - ECE 751 Embedded Computing Systems

Lecture 6: Embedded

Processors

Topics

Demand for Embedded Processors

Flynn’s taxonomy of processors

Other axes of comparison

Embedded RISC processors

RISC processor families

Embedded DSP Processors

Example: TI C55x/C54x DSPs

TI C55x microarchitecture

Parallelism extraction

VLIW architectures

Simple VLIW architecture

TI C62x/C67x DSPs

TI C6x data operations

C6x block diagram

Texas Instruments C62x

Emerging DSP Architectures

Superscalar processors

SIMD and subword parallelism

SIMD Instructions

TriMedia TM-1 characteristics

Trimedia TM-1

TM-1 VLIW CPU

Multithreading

Dynamic voltage scaling (DVS)

Razor architecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib