Slides 2 - Rabie A. Ramadan

advertisement
Introduction to Embedded Systems
Rabie A. Ramadan
rabieramadan@gmail.com
http://www.rabieramadan.org/classes/2014/e
mbedded/
2
Topics




Embedded microprocessor market.
Categories of CPUs.
RISC, DSP, and Multimedia processors.
CPU mechanisms.
2
Demand for Embedded Processors

Embedded processors
account for
•

Over 97% of total processors
sold
Sales expected to increase
by roughly 15% each year
Evaluating Processors

Performance
• Latency : the time required to execute an instruction from start
to finish,
• Throughput : the rate at which instructions are finished
Evaluating Processors

At the program level, computer architects also
speak of average performance or peak
performance.
Often calculated assuming that instruction
throughput proceeds at its maximum rate and
all processor resources are fully utilized
Evaluating Processors

Embedded system designers often talk about
program performance in terms of worst-case (or
sometimes best-case) performance:
This is not simply a characteristic of the
processor; it is determined for a particular
program running on a given processor.
Evaluating Processors

Cost
The purchase price of the processor.
In VLSI design, cost is often measured in
terms of the silicon area required to implement
a processor, which is closely related to chip
cost.
Evaluating Processors

Energy and power
In modern processors, energy and power
consumption must be measured for a particular
program and data for accurate results.
Evaluating Processors

Predictability
Important characteristic for embedded systems
When designing real-time systems, we want to be
able to predict execution time.
More difficult to measure.
Evaluating Processors

Security
An important characteristic of all processors,
including embedded processors.
Security is inherently unmeasurable because of
the fact that we do not know of a successful
attack on a system; this does not mean that such
an attack cannot exist.
Basic Computer Architecture
Memory

instruction
Input
unit
data
ALU
Processor
CU
Reg.
Output
unit
Von Neumann
Architecture
Levels of Parallelism




Bit level parallelism
•
Within arithmetic logic circuits
Instruction level parallelism
•
Multiple instructions execute per clock cycle
Memory system parallelism
•
Overlap of memory operations with computation
Operating system parallelism
•
•
More than one processor
Multiple jobs run in parallel
•
•
Loop level
Procedure level
Levels of Parallelism
Bit Level Parallelism
Within arithmetic logic circuits
Levels of Parallelism
Instruction Level Parallelism (ILP)
Multiple instructions execute per clock cycle
•
Pipelining (instruction - data)
• Multiple Issue - Very long instruction word (VLIW)
Levels of Parallelism
Memory System Parallelism
Overlap of memory operations with computation
Levels of Parallelism
Operating System Parallelism
•
•
There are more than one processor
Multiple jobs run in parallel
•
•
Loop level
Procedure level
Flynn’s Taxonomy




Single Instruction stream - Single Data stream (SISD)
Single Instruction stream - Multiple Data stream (SIMD)
Multiple Instruction stream - Single Data stream (MISD)
Multiple Instruction stream - Multiple Data stream (MIMD)
Single Instruction stream - Single Data
stream (SISD)
Memory

instruction
CU
data
ALU
Processor
Von Neumann
Architecture
Flynn’s Taxonomy




Single Instruction stream - Single Data stream (SISD)
Single Instruction stream - Multiple Data stream (SIMD)
Multiple Instruction stream - Single Data stream (MISD)
Multiple Instruction stream - Multiple Data stream (MIMD)
Single Instruction stream - Multiple Data
stream (SIMD)
instruction
PE
data
CU
PE
PE
instruction
Instructions of the
program are broadcast to
more than one processor

Each processor executes
the same instruction
synchronously, but using
different data

Used for applications that
operate upon arrays of
data
data
data
data
Memory
PE

Flynn’s Taxonomy




Single Instruction stream - Single Data stream (SISD)
Single Instruction stream - Multiple Data stream (SIMD)
Multiple Instruction stream - Single Data stream (MISD)
Multiple Instruction stream - Multiple Data stream (MIMD)
Multiple Instruction stream - Multiple
Data stream (MIMD)

Each processor has a separate program

An instruction stream is generated for each
program on each processor

Each instruction operates upon different data
Multiple Instruction stream - Multiple
Data stream (MIMD)

Shared memory

Distributed memory
Shared vs Distributed Memory
M
M
M
M
P
P
P
P

Distributed memory
•
•
Network

P
P
P
Bus
Memory
P
Each processor has its own
local memory
Message-passing is used to
exchange data between
processors
Shared memory
•
•
Single address space
All processes have access
to the pool of shared
memory
Distributed Memory
M
M
M
M
P
P
P
P
NI
NI
NI
NI
Network

Processors cannot
directly access
another processor’s
memory

Each node has a
network interface
(NI) for
communication and
synchronization
Distributed Memory
M
instr
CU
PE
data

data
data
CU
PE
data
M
instr
data
CU
PE
data
M
instr
data
data
CU
PE
Network
M
instr
Each processor
executes different
instructions
asynchronously,
using different
data
Shared Memory
CU
PE
data

CU
PE
data
CU
PE
data
CU
PE
instruction
Memory
data
Each processor
executes different
instructions
asynchronously,
using different
data
Shared Memory
P
P
P
P

Bus
Uniform memory access
(UMA)
•
Memory
P
P
P
P
P
P
P
Bus
Bus
Memory
Memory
Network
P

Each processor has uniform
access to memory (symmetric
multiprocessor - SMP)
Non-uniform memory access
(NUMA)
•
•
•
Time for memory access depends
on the location of data
Local access is faster than nonlocal access
Easier to scale than SMPs
Distributed Shared Memory

Making the main memory of a cluster of
computers look as if it is a single memory with a
single address space

Shared memory programming techniques can be
used
Multicore Systems



Many general purpose processors
GPU (Graphics Processor Unit)
GPGPU (General Purpose GPU)
Hybrid
The trend is:
 Board composed of multiple many core
chips sharing memory
 Rack composed of multiple boards
 A room full of these racks
Memory

Other axes of comparison






RISC vs. CISC---Instruction set style.
Instruction issue width.
Static vs. dynamic scheduling for multiple-issue
machines.
Scalar vs. vector processing.
Single-threaded vs. multithreading.
A single CPU can fit into multiple categories.
RISC vs. CISC



Complex Instruction Set Computer
“High level” Instruction Set
• Executes several “low level operations”
• Ex: load, arithmetic operation, memory
store
– VAX, Intel X86, IBM 360/370, etc.
32
Features of CISC



Small number of general purpose registers
Instructions take multiple clocks to execute
Few lines of code per operation
RISC vs. CISC

Reduced Instruction Set Computer
• RISC is a CPU design that recognizes only a
limited number of instructions
• Simple instructions
• Instructions are executed quickly
MIPS, DEC Alpha, SUN Sparc, IBM 801
34
Features of RISC





“Reduced” instruction set
Executes a series of simple instruction instead of a complex
instruction
Instructions are executed within one clock cycle
Incorporates a large number of general registers for
arithmetic operations to avoid storing variables on a
stack in memory
Pipelining = speed
Single issue versus Multiple issue

Instruction issue width
important aspect of processor performance.
Processors that can issue more than one instruction per
cycle generally execute programs faster.
They do so at the cost of increased power consumption
and higher cost.
36
static versus dynamic scheduling


Static scheduling
• instructions is determined when the program is
written.
Dynamic scheduling
• determines which instructions are issued at
runtime.
•
Superscalar is a common technique for
dynamic instruction issue - Tomasulo
37
Embedded vs. general-purpose
processors

Embedded processors may be customized for a
category of applications.
• Customization may be narrow or broad.

We may judge embedded processors using
different metrics:
• Code size.
• Energy efficiency.
• Memory system performance.
• Predictability.
Embedded RISC processors


RISC processors often have
simple, highly-pipelinable
instructions
Pipelines of embedded RISC
processors have grown over
time:
•
•
•
ARM7 has 3-stage pipeline.
ARM9 has 5-stage pipeline
ARM11 has 8-stage pipeline.
RISC processor families

ARM:
•
•


ARM7 has in-order execution, and no memory management or branch
prediction;
ARM9 ARM11 has out of order execution, memory management, and
branch prediction,
MIPS:
•
•
•
MIPS32 4K has 5-stage pipeline;
4KE family has DSP extension;
4KS is designed for security.
PowerPC:
•
•
PowerPC 400 series includes several embedded processors;
Motorola and IBM offer superscalar versions of the PowerPC
Embedded DSP Processors

Embedded DSP processors are optimized to perform DSP
algorithms; speech coding, filtering, convolution, fast
Fourier transforms, discrete cosine transforms
N
y k   bn x k  n
n0
Embedded DSP Processors- example





AT&T DSP-16 was the first DSP
it had an onboard multiplier and
provided a multiply–accumulate
instruction.
dest = src1*src2 + src3, a common
operation in digital signal processing.
Based on Harvard-architecture with
separate data and instruction
memories.
Data accesses could rely on consistent
bandwidth from the memory, which is
particularly important for sampleddata systems.
42
Parallelism extraction

 Dynamic:
Static:
• Use hardware to
• Use compiler to
identify opportunities.
analyze program.
• More complex CPU.
• Simpler CPU.
• Can make use of data
• Can’t depend on data
values.
values.
• Very Long Instruction • Superscalar
Word (VLIW)
Very Long Instruction Word (VLIW)




Widespread use in embedded
systems
provide instruction-level parallelism
with relatively low hardware
overhead.
The execution unit includes a pool of
function units connected to a large
register file.
the execution unit reads a packet of
instructions—each instruction in the
packet can control one of the
function units in the machine.
44
Simple VLIW architecture

Large register file feeds multiple function units.
E box
Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
Register file
ALU
ALU
Load/store Load/store FU
Clustered VLIW architecture

Register file, function units divided into clusters.
Cluster bus
Execution
Execution
Register file
Register file
Very Long Instruction Word (VLIW)

Example 1 :
• Trimedia family of processors
• designed for use in video systems.
• Video algorithms often perform similar operations on
several pixels at time.
47
Very Long Instruction Word (VLIW)

Example 2 : Texas Instruments C6x VLIW DSP
48
Very Long Instruction Word (VLIW)
Example 2: Texas Instruments C6x VLIW DSP

Onboard program and a data RAM as
well as standard devices and DMA.

The processor core includes two clusters,
each with the same configuration.


Each register file holds 16 words.
Each data path has eight function units:
two load units, two store units, two data
address units, and two register file cross
paths.
49
Superscalar Processors




more than one instruction per clock cycle.
Unlike VLIW processors, they check for resource
conflicts on-the-fly to determine which combinations of
instructions can be issued at each step.
Superscalar processors are not as common in the
embedded world.
Used to some extent in embedded processors.
• Embedded Pentium is two-issue in-order.
• Some PowerPCs are superscalar
50
SIMD Instructions and Parallelism



Recent multimedia processors commonly support
Single Instruction Multiple data (SIMD) instructions
The same operation is performed on multiple data
operands using a single instruction
A3
A2
A1
A0
B3
B2
B1
B0
A3+B3
A2+B2
A1+B1
A0+B0
Exploits low precision and high data parallelism of
multimedia applications
Thread-Level Parallelism

Hardware multithreading
• Alternately fetches instructions from separate
threads.
• On one cycle, it fetches several instructions
from one thread, fetching enough instructions
to be able to keep the pipelines full.
• On the next cycle, it fetches instructions from
another thread.
52
Thread-Level Parallelism

Simultaneous multithreading (SMT):
Fetches instructions from several
threads on each cycle rather than
alternating between threads.
53
Better-Than-Worst-Case Design

Digital systems are traditionally designed as synchronous
systems governed by clocks.

The clock period is determined by careful analysis so that
values are stored into registers properly, with the clock
period extended to cover the worst case delay

the worst-case delay is relatively rare in many circuits and
the logic sits idle for some period most of the time.
54
Better-Than-Worst-Case Design

Alternative design style in which logic detects and
recovers from errors, allowing the circuit to run most of
the time at a higher speed.

Razor Architecture uses a specialized register that
measures and evaluates errors
55
Better-Than-Worst-Case Design




The system register holds the latched
value and is clocked at the higher-thanworst-case clock rate.
A separate register is clocked
separately and slightly behind the
system register.
If the results stored in the two registers
are different, then an error occurred,
probably due to timing.
The XOR gate measures that error and
causes the latter value to replace the
value in the system register.
56
Download