PPT - ECE 751 Embedded Computing Systems

advertisement

Lecture 6: Embedded

Processors

Embedded Computing Systems

Mikko Lipasti, adapted from M. Schulte

Based on slides and textbook from Wayne Wolf

© 2007 Elsevier

Topics

 Embedded microprocessor market.

 Categories of CPUs.

 RISC, DSP, and Multimedia processors.

 CPU mechanisms.

High Performance Embedded Computing

© 2007 Elsevier

Demand for Embedded Processors

 Embedded processors account for

Over 97% of total processors sold

Over 60% of total sales from processors

 Sales expected to increase by roughly 15% each year

High Performance Embedded Computing

© 2007 Elsevier

Flynn’s taxonomy of processors

 Single-instruction single-data (SISD)

 Single-instruction multiple-data (SIMD)

 Multiple-instruction multiple-data (MIMD)

 Multiple-instruction single data (MISD)

 What is an example of each?

 Which would you expect to see in embedded systems?

High Performance Embedded Computing

© 2007 Elsevier

Other axes of comparison

 RISC vs. CISC---Instruction set style.

 Instruction issue width.

 Static vs. dynamic scheduling for multipleissue machines.

 Scalar vs. vector processing.

 Single-threaded vs. multithreading.

 A single CPU can fit into multiple categories.

High Performance Embedded Computing

© 2007 Elsevier

Embedded vs. general-purpose processors

 Embedded processors may be customized for a category of applications.

Customization may be narrow or broad.

 We may judge embedded processors using different metrics:

Code size.

Energy efficiency.

Memory system performance.

Predictability.

High Performance Embedded Computing

© 2007 Elsevier

Embedded RISC processors

RISC processors often have simple, highlypipelinable instructions

Pipelines of embedded

RISC processors have grown over time:

ARM7 has 3-stage pipeline.

ARM9 has 5-stage pipeline

ARM11 has 8-stage pipeline.

ARM11 pipeline [ARM05].

High Performance Embedded Computing

© 2007 Elsevier

RISC processor families

 ARM:

ARM7 has in-order execution, and no memory management or branch prediction;

ARM9 ARM11 has out of order execution, memory management, and branch prediction,

 MIPS:

MIPS32 4K has 5-stage pipeline;

4KE family has DSP extension;

4KS is designed for security.

 PowerPC:

PowerPC 400 series includes several embedded processors;

Motorola and IBM offer superscalar versions of the PowerPC

High Performance Embedded Computing

© 2007 Elsevier

Embedded DSP Processors

 Embedded DSP processors are optimized to perform

DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms

 y k

 n

N 

0 b x

DSP processors feature

Deterministic execution times

Fast multiply-accumulate instructions

Multiple data accesses per cycle

Specialized addressing modes

Efficient support for loops and interrupts

Efficient processing of “streaming” data

High Performance Embedded Computing

© 2007 Elsevier

Example: TI C55x/C54x DSPs

 40-bit arithmetic (32-bit values + 8 guard bits).

Barrel shifter.

17 x 17 multiplier.

Two address generators.

Lots of special purpose registers and addressing modes

Coprocessors for compute-intensive functions including pixel interpolation, motion estimation, and DCT/IDCT computations

High Performance Embedded Computing

© 2007 Elsevier

TI C55x microarchitecture

High Performance Embedded Computing

© 2007 Elsevier

Parallelism extraction

 Static:

Use compiler to analyze program.

Simpler CPU.

Can’t depend on data values.

VLIW

 Dynamic:

Use hardware to identify opportunities.

More complex CPU.

Can make use of data values.

Superscalar

High Performance Embedded Computing

© 2007 Elsevier

VLIW architectures

 Each very long instruction word (VLIW) erforms multiple operations in parallel

Branch Memory Memory Arithmetic Logic Vector

Needs a good compiler that understands the architecture

Allows deterministic execution times

Code growth can be reduced by allowing

Operations within an instruction to be performed sequentially

A given field to specify different types of operations

Seq Branch/Mem Mem/Arith Arith/Logic Vector

High Performance Embedded Computing

© 2007 Elsevier

Simple VLIW architecture

 Large register file feeds multiple function units.

E box

Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU

High Performance Embedded Computing

© 2007 Elsevier

Clustered VLIW architecture

Register file, function units divided into clusters.

What are advantages/disadvantages of having clusters in VLIW architectures?

Cluster bus

Execution

Register file

Execution

Register file

High Performance Embedded Computing

© 2007 Elsevier

TI C62x/C67x DSPs

 VLIW with up to 8 instructions/cycle.

 32 32-bit registers.

 Function units:

Two multipliers.

Six ALUs.

 All instructions execute conditionally.

High Performance Embedded Computing

© 2007 Elsevier

TI C6x data operations

 8/16/32-bit arithmetic.

 40-bit operations.

 Bit manipulation operations.

 C67x processors add floating-point arithmetic.

High Performance Embedded Computing

© 2007 Elsevier

C6x block diagram

Program RAM/cache

512K bits bus

Data RAM

512K bits

Execute

DMA

Data path 1/

Reg file 1

Data path 2/

Reg file 2

High Performance Embedded Computing

© 2007 Elsevier

JTAG timers

Serial

PLL

Texas Instruments C62x

IEEE Signal Processing Magazine , v. 15, no. 2, pp. 86-101, 117, 1998.

Emerging DSP Architectures

Parallelism at multiple levels

Multiple processors

 System-on-a-chip designs

Multiple simultaneous tasks

 Multithreaded processors

Multiple instruction per cycle

 Very Long Instruction Word (VLIW) architectures

Multiple operation per instruction

Single Instruction Multiple Data (SIMD) instructions

Architecture/compiler pairs improve performance and help manage application complexity

High Performance Embedded Computing

© 2007 Elsevier

Superscalar processors

 Instructions are dynamically scheduled.

Dependencies are checked at run time in hardware.

 Used to some extent in embedded processors.

Embedded Pentium is two-issue in-order.

Some PowerPCs are superscalar

 What advantages/disadvantages do VLIW processors compared to superscalar?

High Performance Embedded Computing

© 2007 Elsevier

SIMD and subword parallelism

 Many special-purpose SIMD machines

All processors perform same operation on different data

 Subword parallelism is widely used for video.

ALU is divided into subwords for independent operations on small operands.

 Vector processing is another form of SIMD processing

 Lots of times these terms are interchanged

High Performance Embedded Computing

© 2007 Elsevier

SIMD Instructions

Recent multimedia processors commonly support

Single Instruction Multiple data (SIMD) instructions

The same operation is performed on multiple data operands using a single instruction

A3 A2 A1 A0

B3

A3+B3

B2

A2+B2

B1

A1+B1

B0

A0+B0

 Exploits low precision and high data parallelism of multimedia applications

High Performance Embedded Computing

© 2007 Elsevier

Operand characteristics in MediaBench

High Performance Embedded Computing

© 2007 Elsevier

Dynamic behavior of loops in MediaBench

The loops of media applications in many cases are not very deep

Path ratio =

(instructions executed per iteration) / (total number of loop instructions).

What does the path ratio reveal?

High Performance Embedded Computing

© 2007 Elsevier

TriMedia TM-1 characteristics

 Characteristics

 Floating point support

Sub-word parallelism support

VLIW

Additional custom operations

High Performance Embedded Computing

© 2007 Elsevier

Trimedia TM-1

video in audio in

I 2 C timers image co-p

PCI memory interface video out audio out serial

VLD co-p

VLIW CPU

High Performance Embedded Computing

© 2007 Elsevier

TM-1 VLIW CPU

register file

FU1 read/write crossbar

...

FU27 slot 1 slot 2 slot 3 slot 4 slot 5

High Performance Embedded Computing

© 2007 Elsevier

Multithreading

 Low-level parallelism mechanism.

Interleaved multithreading (IMT) alternately fetches instructions from separate threads.

Often used with VLIW and vector processors

Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle.

Often used with superscalar processors

 What advantages/disadvantages does IMT have relative to SMT?

High Performance Embedded Computing

© 2007 Elsevier

Dynamic voltage scaling (DVS)

Power scales with V 2 while performance scales roughly as V.

Reduce operating voltage, add parallel operating units to make up for lower clock speed.

DVS doesn’t work well in processors with highleakage power.

High Performance Embedded Computing

© 2007 Elsevier

Dynamic voltage and frequency scaling

(DVFS)

Scale both voltage and clock frequency.

Can use control algorithms to match performance to application, reduce power.

High Performance Embedded Computing

© 2007 Elsevier

Razor architecture

Razor runs clock faster than worst case allows

Used specialized latch to detect errors.

Recovers only on errors, gains averagecase performance.

High Performance Embedded Computing

© 2007 Elsevier

Download