General Purpose Processor

advertisement
1
GENERAL PURPOSE PROCESSOR
Introduction
2

General-Purpose Processor
 Processor
designed for a variety of computation tasks
 Low unit cost, in part because manufacturer spreads
NRE over large numbers of units
 Motorola
sold half a billion 68HC05 microcontrollers in
1996 alone
 Carefully
 Can
designed since higher NRE is acceptable
yield good performance, size and power
 Low
NRE cost for Embedded system designer, short
time-to-market/prototype, high flexibility
 User
just writes software; no processor design
Basic Architecture
3

Control unit and
datapath


Processor
Control unit
Similar to single-purpose
processor
Datapath
ALU
Controller
Control
/Status
Key differences


Datapath is general
Control unit doesn’t store
the algorithm – the
algorithm is
“programmed” into the
memory
Registers
PC
IR
I/O
Memory
Datapath Operations
4

Load

•
Processor
Read memory location
into register
Control unit
Datapath
ALU
ALU operation
Controller
– Input certain registers
through ALU, store back
in register
•
Registers
Store
– Write register to
memory location
+1
Control
/Status
10
PC
IR
I/O
Memory
...
10
11
...
11
Control Unit
5

Control unit: configures the datapath
operations


Processor
Sequence of desired operations
(“instructions”) stored in memory –
“program”
Control unit
ALU
Controller
Instruction cycle – broken into several
sub-operations, each one clock cycle,
e.g.:





Fetch: Get next instruction into IR
Decode: Determine what the instruction
means
Fetch operands: Move data from
memory to datapath register
Execute: Move data through the ALU
Store results: Write data from register
to memory
Datapath
Control
/Status
Registers
PC
IR
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
...
500 10
501 ...
R1
Instruction Cycles
6
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC 100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
...
500 10
501 ...
R1
Instruction Cycles
7
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
+1
Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops
results
clk
10
PC 101
IR
inc R1, R0
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
...
500 10
501 ...
11
R1
Instruction Cycles
8
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops
results
clk
10
PC 102
IR
store M[501], R1
R0
PC=102
Fetch Decode Fetch Exec. Store
ops
results
clk
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
...
500 10
501 11
...
11
R1
Architectural Considerations
9

N-bit processor
ALU, registers,
buses, memory data
interface
 Embedded: 8-bit,
16-bit, 32-bit
common
 Desktop/servers:
32-bit, even 64
Processor
 N-bit

PC size determines
address space
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
I/O
Memory
Architectural Considerations
10

Clock frequency
 Inverse
Processor
Control unit
of clock
period
 Must be longer than
longest register to
register delay in
entire processor
 Memory access is
often the longest
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
I/O
Memory
ARM
Introduction
ARM RISC Design Philosophy



Smaller die size
Shorter Development time
Higher performance
 Insects
flap wings faster than small birds
 Complex instruction will make some high level function
more efficient but will slow down the clock for all
instructions
ARM Design philosophy



Reduce power consumption and extend battery life
High Code density
Low price
 Embedded
systems prefer slow and low cost memory
 Reduce area of the die taken by embedded processor
 Leave
space for specialized processor
 Hardware debug capability

ARM is not a pure RISC Architecture
 Designed
primarily for embedded systems
Instruction set for embedded systems

Variable cycle execution for certain instructions
 Multi
registers Load-store instructions
 Faster if memory access is sequential
 Higher code density – common operation at start and
end of function

Inline barrel shifting – leads to complex instructions
 Improved
code density
 E.g. ADD r0,r1,r1, LSL #1
Instruction set for embedded systems

Thumb 16 bit instruction set
 Code

can execute both 16 or 32 bit instruction
Conditional execution
 Improved
code density
 Reduce branch instructions
 CMP r1,r2
 SUBGT r1,r1,r2
 SUBLT r2,r2,r1

Enhanced instructions – DSP Instructions
 Use
one processor instead of traditional combination of two
Arm Based Embedded device
Peripherals


ALL ARM Peripherals are Memory Mapped
Interrupt Controllers

Standard Interrupt Controller
Sends a interrupt signal to processor core
 Can be programmed to ignore or mask an individual device or set
of devices
 Interrupt handler read a device bitmap register to determine
which device requires servicing


VIC- Vectored interrupt controller
Assigned priority and ISR handler to each device
 Depending on type will call standard Int. Hand. Or jump to
specific device handler directly

ARM Datapath

Registers
 R0-R15
General Purpose registers
 R13-stack pointer
 R14-Link register
 R15 – program counter
 R0-R13 are orthogonal
 Two program status registers
 CPSR
 SPSR
ARM’s visible registers
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15 (PC)
CPSR
user mode
usable in user mode
system modes only
r8_fiq
r9_fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
SPSR_fiq
fiq
mode
r13_svc
r14_svc
r13_abt
r14_abt
SPSR_svc
SPSR_abt
svc
mode
abort
mode
r13_irq
r14_irq
r13_und
r14_und
SPSR_irq SPSR_und
irq
mode
undefined
mode
BANK Registers

Total 37 registers
 20
are hidden from program at different time
 Also called Banked Registers
 Available only when processor in certain mode
 Mode can be changed by program or on exception
 Reset,
interrupt request, fast interrupt request software
interrupt, data abort, prefetch abort and undefined
instruction
 No
SPSR access in user mode
CPSR




Condition flags – NZCV
Interrupt masks – IF
Thumb state- T , Jazelle –J
Mode bits 0-4 – processor mode

Six privileged modes
Abort – failed attempt to access memory
 Fast interrupt request
 Interrupt request
 Supervisor mode – after reset, Kernel work in this mode
 System – special version of user mode – full RW access to CPSR
 Undefined mode – when undefined or not supported inst. Is exec.


User Mode
31
NZCV
28 27
8 7
unused
6
IF
5 4
T
0
mode
Instruction execution
3 Stage pipeline ARM Organization

Fetch


Decode


The instruction is fetched from the memory and placed in the
instruction pipeline
The instruction is decoded and the datapath control signals
prepared for the next cycle. In this stage inst. ‘Owns’ the
decode logic but not the datapath
Execute

The inst. ‘owns’ the datapath; the register bank is read, an
operand shifted, the ALU result generated and written back
into a destination register.
ARM7 Core Diagram
3 stage Pipeline – Single Cycle Inst.
3 stage Pipeline – Multi-Cycle Inst.
PC Behavior

R15 increment twice before an instruction executes


due to pipeline operation
R15=current instruction address+8
 Offset
is +4 for thumb instruction
To get Higher performance



Tprog =(Ninst X CPI ) / fclk
Ninst – No of inst. Executed for a program–Constant
Increase the clock rate
 The
clock rate is limited by slowest pipeline stage
 Decrease
the logic complexity per stage
 Increase the pipeline depth

Improve the CPI
 Instruction
that take more than one cycle are reimplemented to occupy fewer cycles
 Pipeline stalls are reduced
Typical Dynamic Instruction usage
Statistics for a print preview program in an ARM Inst. Emulator
Instruction Type
Dynamic Usage
Data Movement
43%
Control Flow
23%
Arithmetic operation
15%
Comparisons
13%
Logical Operation
5%
Other
1%
Memory Bottleneck

Von Neumann Bottleneck
 Single
inst and data memory
 Limited by available memory bandwidth
 A 3 stage ARM core accesses memory on (almost) every
clock

Harvard Architecture in higher performance arm
cores
The 5 stage pipeline

Fetch


Decode


An operand is shifted and the ALU result generated. For load and
store memory address is computed
Buffer/Data


Inst. Is decoded and register operand read from the register file
Execute


Inst. Fetched and placed in Inst. Pipeline
Data Memory is accessed if required otherwise ALU result is simply
buffered
Write Back

The results are written back to register file
Data Forwarding

Read after write pipeline hazard
 An
instruction needs to use the result of one of its
predecessors before that result has returned to the
register file


e.g. Add r1,r2,r3
Add r4,r5,r1
 Data
forwarding is used to eliminate stall
 In following case even with forwarding it is not possible
to avoid a pipeline stall


E.g LDR rN, [..]
ADD r2,r1,rN
 Processor
; Load rN from somewhere
; and use it immediately
cannot avoid one cycle stall
Data Hazards
 Handling
data hazard in software
 Solution-
Encourage compiler to not put a depended
instruction immediately after a load instruction
 Side
effects
 When
a location other than one explicitly named in an
instruction as destination operand is affected
 Addressing
 Complex
modes
addressing modes doesn’t necessarily leads to
faster execution
 E.g. Load (X(R1)),R2
 Add #X,R1,R2
 Load (R2),R2
 Load (R2),R2
Data Hazards

Complex addressing
require more complex hardware to decode and execute
them
 Cause the pipeline to stall


Pipelining features
Access to an operand does not require more than one access
to memory
 Only load and store instruction access memory
 The addressing modes used do not have side effects



Register, register indirect, index modes
Condition codes
Flags are modified by as few instruction as possible
 Compiler should be able to specify in which instr. Of the
program they are affected and in which they are not

Complex Addressing Mode
Load (X(R1)), R2
Clock cycle 1
2
3
Load
D
X + [R1]
F
4
5
6
[X +[R1]] [[X +[R1]]]
Time
7
W
Forward
Next instruction
F
D
(a) Complex addressing mode
E
W
Simple Addressing Mode
Add #X, R1, R2
Load (R2), R2
Load (R2), R2
Add
F
Load
Load
Next instruction
D
X + [R1]
W
F
D
[X +[R1]]
W
F
D
[[X +[R1]]]
W
F
D
E
(b) Simple addressing mode
W
ARM 5 Stage Pipeline
Instruction hazards - Overview



Whenever the stream of instructions supplied by the
instruction fetch unit is interrupted, the pipeline stalls.
Cache miss
Branch
Time
Clock cy cle
1
2
F1
E1
3
4
5
6
Instruction
I1
I 2 (Branch)
F2
Execution unit idle
E2
Unconditional Branches
I3
Ik
I k+1
F3
X
Fk
Ek
Fk+1
Ek+1
Figure 8.8. An idle yccle caused by a branch instruction.
I k+1
Fk+1
D k+1 Ek+1
(b) Branch address computed in Decode stage
Figure 8.9. Branch timing.
Branch Timing
- Branch penalty
- Reducing the penalty
Instruction Queue and Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction
D : Dispatch/
Decode
unit
E : Execute
instruction
W : Write
results
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Branch Timing with Instruction Queue
Clock cycle
1
2
3
4
5
6
7
8
9
10
Queue length
1
1
1
1
2
3
2
1
1
1
I1
F1
D1
E1
E1
E1
W1
F2
D2
E2
W2
F3
D3
E3
W3
D4
E4
W4
Dk
Ek
I2
I3
F4
I4
I 5 (Branch)
I6
Ik
I k+1
F5
Branch folding
D5
F6
Time
X
Fk
Fk+1
Wk
D k+1 Ek+1
Figure 8.11. Branch timing in the presence of an instruction queue.
Branch target address is computed in the D stage.
Branch Folding





Branch folding – executing the branch instruction concurrently with the
execution of other instructions.
Branch folding occurs only if at the time a branch instruction is
encountered, at least one instruction is available in the queue other than
the branch instruction.
Therefore, it is desirable to arrange for the queue to be full most of the
time, to ensure an adequate supply of instructions for processing.
This can be achieved by increasing the rate at which the fetch unit
reads instructions from the cache.
Having an instruction queue is also beneficial in dealing with cache
misses.
Conditional Braches



A conditional branch instruction introduces the
added hazard caused by the dependency of the
branch condition on the result of a preceding
instruction.
The decision to branch cannot be made until the
execution of that instruction has been completed.
Branch instructions represent about 20% of the
dynamic instruction count of most programs.
Delayed Branch



The instructions in the delay slots are always
fetched. Therefore, we would like to arrange for
them to be fully executed whether or not the branch
is taken.
The objective is to place useful instructions in these
slots.
The effectiveness of the delayed branch approach
depends on how often it is possible to reorder
instructions.
Delayed Branch
LOOP
NEXT
Shift_left
Decrement
Branch=0
Add
R1
R2
LOOP
R1,R3
(a) Original program loop
LOOP
NEXT
Decrement
Branch=0
Shift_left
Add
R2
LOOP
R1
R1,R3
(b) Reordered instructions
Figure 8.12. Reordering of instructions for a delayed branch.
Delayed Branch
Time
Clock cycle
1
2
F
E
3
4
5
6
7
8
Instruction
Decrement
Branch
Shift (delay slot)
Decrement (Branch tak
en)
Branch
Shift (delay slot)
Add (Branch not tak
en)
F
E
F
E
F
E
F
E
F
E
F
E
Figure 8.13.
Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.
Branch Prediction





To predict whether or not a particular branch will be taken.
Simplest form: assume branch will not take place and continue to fetch
instructions in sequential address order.
Until the branch is evaluated, instruction execution along the predicted
path must be done on a speculative basis.
Speculative execution: instructions are executed before the processor is
certain that they are in the correct execution sequence.
Need to be careful so that no processor registers or memory locations
are updated until it is confirmed that these instructions should indeed be
executed.
Incorrectly Predicted Branch
Time
Clock cycle
1
2
3
4
5
F1
D1
E1
W1
F2
D2/P2
E2
F3
D3
X
F4
X
6
Instruction
I 1 (Compare)
I 2 (Branch>0)
I3
I4
Ik
Fk
Dk
Figure 8.14.Timing when a branch decision has been incorrectly predicted
as not taken.
Branch Prediction




Better performance can be achieved if we arrange for
some branch instructions to be predicted as taken and
others as not taken.
Use hardware to observe whether the target address is
lower or higher than that of the branch instruction.
Let compiler include a branch prediction bit.
So far the branch prediction decision is always the same
every time a given instruction is executed – static branch
prediction.
Superscalar operation


Maximum Throughput - One instruction per clock
cycle
Multiple processing units
 More
than one instruction per cycle
Instruction queue
Floatingpoint
unit
Dispatch
unit
W : Write
results
Integer
unit
Superscalar
Figure 8.19. A processor with two execution units.
Timing
Time
Clock cycle
1
2
3
4
5
6
I 1 (Fadd)
F1
D1
E1A
E1B
E1C
W1
I 2 (Add)
F2
D2
E2
W2
I 3 (Fsub)
F3
D3
E3
E3
E3
I 4 (Sub)
F4
D4
E4
W4
7
W3
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
assuming no hazards are encountered.
ALU

Logic operation
OR,AND, XOR, NOT, NAND, NOR etc.
 No dependencies among bits – Each result can be
calculated in parallel for every bit


Arithmetic operation
ADD, SUB, INC, DEC, MUL, DIVIDE
 Involve long carry propagation chain


Major source of delay


Require optimization
Suitability of algorithm based on
Resource usage – physical space on silicon die
 Turnaround time

The original ARM1 ripple-carry adder
circuit
Cout
A
B
sum
Cin
The ARM2 4-bit carry look-ahead
scheme
Cout[3]
A[3:0]
G
4-bit
adder
logic
P
B[3:0]
Cin[0]
sum[3:0]
The ARM2 ALU logic for one result bit
fs:
NB
bu s
5
01 23
4
carry
log ic
G
AL U
bu s
P
NA
bu s
ARM2 ALU function codes
fs 5
fs 4
fs 3
fs 2
fs 1
fs 0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
1
1
0
0
1
0
1
0
0
0
0
1
1
0
0
0
1
0
0
0
1
1
0
0
0
1
1
0
0
A LU o ut p ut
A an d B
A an d n o t B
A xor B
A p l us n o t B p l us carry
A p l us B p l us carry
n o t A p l us B p l us carry
A
A or B
B
not B
zero
The ARM6 carry-select adder scheme
a,b[3:0]
+
c
a,b[31:28]
+, +1
s
+, +1
s+1
mux
mux
mux
sum[3:0]
sum[7:4]
sum[15:8]
sum[31:16]
Conditional Sum Adder


Extension of carry-select adder
Carry select adder
 One-level
using k/2-bit adders
 Two-level using k/4-bit adders
 Three-level using k/8-bit adders
 Etc.

Assuming k is a power of two, eventually have an
extreme where there are log2k-levels using 1-bit
adders
 This
is a conditional sum adder
Conditional sum - example
Conditional Sum Adder:
Top-Level Block for One Bit Position
The ARM6 ALU organization
A operand latch
invert
B operand latch
invert B
A
XOR gates
XOR gates
C in
function
logic functions
adder
C
V
logic/arithmetic
result mux
N
zero detect
result
Z
The cross-bar switch barrel shifter
principle
right 3 right 2 right 1 no shift
in[3]
left 1
in[2]
left 2
in[1]
left 3
in[0]
out[0] out[1] out[2] out[3]
Shift implementation

For left or right shift one diagonal is turned on
 Shifter
operate in negative logic
 Precharging sets all output logic to ‘0’.


For rotate right, the right shift diagonal is enabled
together with complimentary left shift diagonal
Arithmetic shift uses sign extension rather than ‘0’ fill
Multiplier


ARM include hardware support for integer
multiplication
Older ARM cores include low cost multiplication
hardware
 Support
32 bit result multiply and multiply accumulate
 Uses the main datapath iteratively
 Barrel
shifter and ALU to generate 2 bit product in each cycle
 Employ a modified booth’s algorithm to produce 2 bit product
Multiplier




Radix 2 multiplication
Radix 4 multiplication
Radix 2 Booth algorithm
Radix 4 booth algorithm
Modified Booth’s Recoding
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
xi+1
xi
xi–1
yi+1
yi
zi/2
Explanation
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
0
0
0
0
0
0
No string of 1s in sight
0
0
1
0
1
1
End of string of 1s
0
1
0
0
1
1
Isolated 1
0
1
1
1
0
2
End of string of 1s
-1
-2
1
0
0
0
Beginning of string of 1s
1
0
1
1
1
1
End a string, begin new one
1
1
0
0
1
1
Beginning of string of 1s
1
1
1
0
0
0
Continuation of string of 1s
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Recoded
radix-2 digits
Radix-4 digit
Context
Example
1 0 0 1
(1) -1 0 1 0
(1)
-2
2
1 1 0 1 1 0 1 0 1 1 1 0
0 -1 1 0 -1 1 -1 1 0 0 -1 0
-1
2
-1
-1
0
-2
Operand x
Recoded version y
Radix-4 version z
Example : Modified Booth’s Recoding
================================
a
0 1 1 0
x
1 0 1 0
-1
-2
z
Radix-4
================================
p(0)
0 0 0 0 0 0
+z0a
1 1 0 1 0 0
–––––––––––––––––––––––––––––––––
4p(1)
1 1 0 1 0 0
p(1)
1 1 1 1 0 1 0 0
+z1a
1 1 1 0 1 0
–––––––––––––––––––––––––––––––––
4p(2)
1 1 0 1 1 1 0 0
p(2)
1 1 0 1 1 1 0 0
================================

a
x
M ultiplic and
M ultiplier
(x x )
a 40
(x x )
a 41
1 0 tw o
3 2 tw o
p
P roduc t
High speed multiplier

Recent cores have high performance multiplication
hardware
 Support
64 bit result multiply and multiply accumulate
Multiplier: Carry Save Addition







In Multiplication multiple partial products are added
simultaneously using 2-operand adders
Time-consuming carry-propagation must be repeated several
times: k operands - k-1 propagations
Techniques for lowering this penalty exist - Carry-save
addition
Carry propagates only in last step - other steps generate
partial sum and sequence of carries
Basic CSA accepts 3 n-bit operands; generates 2n-bit results:
n-bit partial sum, n-bit carry
Second CSA accepts the 2 sequences and another input
operand, generates new partial sum and carry
CSA reduces number of operands to be added from 3 to 2
without carry propagation
CSA-Basic unit - (3,2)Counter


Simplest implementation - full adder (FA) with 3 inputs
x,y,z
x+y+z=2c+s (s,c - sum and carry outputs)
Outputs - weighted binary representation of number of
1's in inputs
 FA called a (3,2) counter
 n-bit CSA: n(3,2)counters in
parallel with no carry links

(a)Carry-propagate (b)carry-save
A
B Cin
A
+
( a)
Cout
A
B Cin
S
B Cin
Cout
A
+
( b)
Cout
A
+
B Cin
S
B Cin
Cout
A
+
S
Cout
A
+
B Cin
+
S
B Cin
Cout
A
+
S
Cout
S
B Cin
+
S
Cout
S
Cascaded CSA for four 4 bit operands
 Upper
2 levels - 4-bit CSAs
 3rd level - 4-bit carry-propagating adder (CPA)
Wallace Tree

Better Organization for CSA – faster operation time
ARM high-speed multiplier
organization
initiali za tion fo r MLA
regis ters
Rs >> 8 bits /c yc le
Rm
rotate s um an d
carry 8 bits/cycle
carry-save adders
partial sum
partial carry
ALU (add partials)
Download