Chapter V Processor Architecture

advertisement
PROCESSOR ARCHITECTURE
Jehan-François Pâris
jparis@uh.edu
Chapter Organization
•
•
•
•
Logic design conventions
Implementation of a "toy" CPU
Pipelining
Pipelining hazards
IMPORTANT
– Data hazards
– Control hazards
• Exceptions
• Parallelism
LOGIC DESIGN CONVENTIONS
Combinational/state elements
• Combinational elements:
– Outputs only depend on current inputs
– Stateless
• Adders and, more generally, arithmetic
logic unit (ALU)
Combinational/state elements
• State elements:
– Have a memory holding a state
– Output depends on current inputs and state of
element
– State reflects past inputs
• Flip-flops, …
Judicial analogy
• In our legal system
– Guilty/not guilty decision is stateless
• Good reasons
– Sentencing decision is not
• "Three strikes and you are out" laws
• Good reasons
Clocking methodology
• We will assume an edge-triggered clocking
technology
– Edge is short-enough to prevent data
propagation in state elements
– Can read current state of a memory element
at the same time we update it
Clocking convention
• Omit write control signal if state element is
updated at every active clock edge
A "TOY" CPU
Motivation
• "Toy" CPU will implement a subset of MIPS
instruction set
• Subset will be
– Self-sufficient
– Simpler to implement
– Complex enough to allow a serious discussion
of CPU architecture
The subset
• Will include
– Load and store instructions:
lw (load word) and sw (store word)
– Arithmetic-logic instructions:
add, sub, and, or and slt (set less than)
– Branch instructions:
beq (branch if equal) and j (jump)
Load and store instructions
• Format I
• Three operands:
– Two registers $r1 and $r2
– One displacement d
• lw $r1, d($r2) loads into register $r1 main
memory word at address contents($r2) + d
• sw $r1, d($r2) stores contents of register $r1 into
main memory word at address contents($r2) + d
Arithmetic-logic instructions
• Format R
• Three operands:
– Three registers $r1, $r2 and $r3
• Store into register $r1 result of $r2 <op> $r3
where <op> can be add, subtract, and, or
as well as set if less than
Branch instruction
• Format I
• Three operands:
– Two registers $r1 and $r2
– One displacement d
• beq $r1, $r2, d
set value of PC to PC+4 + 4×d
iff $r1 = $r2
The simplest data path
• Assume CPU will do nothing but
– Incrementing its program counter and
– Deliver the next instruction
The simplest data path
4
P
C
Instruction
Memory
Read address
Instruction
Add
Implementing R2R instructions
• Takes two 32-bit inputs
• Returns
– A 32-bit output
– A 1-bit signal if the result is zero
The register file
• Two read outputs that are always available
• One write input activated by a RegWrite signal
• Three register selectors
The register file
5
5
5
Read select 1
Read data 1
Read select 2
Read data 2
Write select
Write data
RegWrite:
enables register writes
Implementing R2R instructions
Register
file
RegWrite is enabled
ALU
Result
Implementing load and store
• Require
– An address calculation:
• contents($r2) + d
– An access to data memory
• Before doing the address calculation, we must
transform 16-bit displacement d into a 32-bit
value using sign extension
The data memory
•
•
•
•
One address selector
One write data input
One read data output
Two controls
– MemWrite
– MemRead
Sign extension (I)
• If 16-bit number has a zero as MSB
– It is positive
– Must add 16 zero bits
0110 1010 1010 0100
0000 0000 0000 0000 0110 1010 1010 0100
Sign extension (II)
• If 16-bit number has a one as MSB
– It is negative
– Must add 16 one bits
1110 1010 1010 0100
1111 1111 1111 1111 1110 1010 1010 0100
The data memory
MemWrite: enables memory writes
Memory address
Read data
Write data
MemRead: enables memory reads
Implementing the store instruction
Register
file
ALU
Address
Write
SE
Sign-extended d field
Read
Implementing the load instruction
Register
file
ALU
SE
d field
SE
Address
Write
Read
Implementing conditional branch
• Target Address:
– Sign-extend 16-bit immediate part of
instruction
– Shift left 2
– Add to PC
• Branch Control Logic:
– Perform test operation on two registers
– Check result
Implementing conditional branch
PC+4
Add
Shift
left 2
Register
file
d field of
instruction
SE
ALU
Branch
Destination
To branch
control logic
Sign-extended d field
Note
• Arithmetic-logic operations only use
– Register file and ALU
• Load and store use
– ALU for computing memory address
– Data memory
Implementing other instructions
Combining everything
Left to be done
• All control signals:
– Two multiplexers: ALUSrc and MemtoReg
– RegWrite, MemRad and MemWrite switches
– ALU controls (4 bits)
ALU control signals
ALU control lines
0000
0001
0010
0110
0111
1100
Function
and
or
add
subtract
set on less than
nor (not in "toy" subset)
Controlling the ALU
• Recall that all R-format instructions have
same opcode
– Operation performed by ALU is specified in the
function field (bits <0:5>)
Controlling the ALU
• ALU control inputs generated by two-step process
– Construct two ALUOp control bits from
opcode
– Construct four ALU control bits using
• Two ALUop bits
• Six bits from function field when they are
needed
Dependence table
Opcode
lw
sw
beq
R-type
R-type
R-type
R-type
R-type
ALUOp
00
00
01
10
10
10
10
10
Operation
lw
sw
beq
add
subtract
and
or
slt
Function Action ALU Ctl
add
0010
add
0010
subtract 0110
100000
add
0010
100010 subtract 0110
100100
and
0000
100101
or
0001
101010
slt
0111
Notes
• Two step process simplifies combinatorial logic
• Many don't care conditions in truth table
Truth table
ALU
Op1
0
0
1
1
1
1
1
ALU F5 F4 F3 F2 F1 F0
ALU
Op2
Control bits
0
X X X X X X
0010
1
X X X X X X
0110
0
X X 0 0 0 0
0010
X
X X 0 0 1 0
0110
0
X X 0 1 0 0
0000
0
X X 0 1 0 1
0001
X
X X 1 0 1 0
0111
Note
• Bits 4 and 5 of function field are not used
• ALUOp bits only have three possible values:
00, 01 and 10
– Introduces don't care conditions
• All R instructions use same data paths
– Other control bits depend only on opcode
Control signal effects
Signal
Regdest
Regwrite
ALUSrc
When deasserted
Destination register
comes from rt field
(bits 20:16)
None
When asserted
Destination register
comes from rd field
(bits 15:10)
Enables write into
destination register
Second ALU operand Second ALU operand
comes from second comes from signregister output
extended displacement
(bits 15:0)
Control signal effects
Signal
PCSrc
When deasserted
PC is incremented
by 4
MemRead None
When asserted
PC set to branch target
value
Enables memory read
output
MemWrite None
Enables memory write
MemtoReg Value fed to
Value fed to destination
destination register register comes from
comes from ALU
memory
Note
• PCSrc is asserted when
– Instruction is a branch
and
– ALU Zero result bit is asserted
• We will introduce a Branch control line
Control line settings
Instruction Rdest ALUsrc MemtoReg RegWrite
R-format
lw
sw
beq
1
0
X
X
0
1
1
0
0
1
X
X
1
1
0
0
Control line settings
Instruction Mem Mem Branch ALUOp ALUOp
Read Write
1
0
R-format
0
0
0
1
0
lw
1
0
0
0
0
sw
0
1
0
0
0
beq
0
0
1
0
1
Active datapaths for a R instruction
Active datapaths for a load instruction
Active datapaths for a beq instruction
The “weird" jump instruction
• Uses J format
– Single 26 bit operand
– Implements an unconditional jump
• New value of PC is obtained as follows
– Bits 1:0 are zero (address is multiple of 4)
– Bits 28:2 come from jump operand
– Bits 31:29 come from PC+4
Implementing the jump instruction
Limitations of single-cycle design
• If we want all instructions to be executed in one
cycle
– Clock cycle must be long enough to
accommodate instruction taking the most time
• Floating-point multiply or divide
• Does not work for CPUs that have a rich
instruction set
PIPELINING
An analogy (I)
•
Washing your clothes
– Four steps:
1. Putting in the washer
2. Putting in the dryer
3. Folding/ironing
4. Putting them away
An analogy (II)
• Most people
– Start second wash load as soon as first wash
load is in dryer
– Put second wash load in dryer and start a
third wash load while they are folding/ironing
the firs washload
Purely sequential approach
Time 6 pm 6:30
Wash
Dry
7pm
7:30
8pm
8:30
Wash
Dry
9pm
9:30
Fold Store
Fold Store
Smart approach
Time 6 pm 6:30
Wash
Dry
Wash
7pm
7:30
8pm
8:30
9pm
Fold Store
Dry
Wash
Fold Store
Dry
Wash
Fold Store
Dry
Fold Store
Solution assumes that a housemate
puts folded/ironed clothes away for us
9:30
Main advantage
• Can do much more in much less time
Limitation
• Slowed down by time taken by longest step
– Could be washing/drying/ironing
Instruction steps (I)
•
Good candidates for pipelining steps
1. Fetch instruction from memory
2. Decode instruction
3. Read registers
4. Execute register to register operation or
calculate address
5. Access operand in memory
6. Write results into a register
Instruction steps (II)
•
Since MIPS instruction set has fixed fields, we
can combine steps 2 and 3
1. Fetch instruction from memory
2. Read registers while decoding instruction
3. Execute register to register operation or
calculate address
4. Access operand in memory
5. Write results into a register
Sample step timings
Instruction Instruction
class
fetch
Register
read
ALU
operation
Data
access
Register
write
Total
time
Load word
(lw)
Store
word (sw)
R format
instruction
200 ps
100ps
200ps
200ps
100ps
800ps
200 ps
100ps
200ps
200ps
---
700ps
200 ps
100ps
200ps
--
100ps
600ps
Branch
(beq)
200 ps
100ps
200ps
--
--
500 ps
Step 1: Fetch and decode
Step 2: Read registers
Step 3: Use the ALU
Step 4: Access operand in memory
Step 5: Store result in register
Observations
• Most R format instructions operate on three
registers and skip step 4
• Same for most I format instructions with an
immediate operand
• Store operations skip step 5
• Load register instructions go through all five
steps
Pipelining limitations
• Some instructions that skip a step will still have
to wait until preceding instruction is done.
• Hazards:
– An instruction cannot proceed because
• Hardware cannot support the combination
of instructions (structural hazards)
• Data are not ready (data hazards)
• Control/branch hazards
Structural hazards
• Combinations of instructions that prevent
pipelining
A bad MIPS instruction (I)
• Recall that IBM instructions set had instructions
allowing to add to a register the contents of a
memory location
– RX format
A bad MIPS instruction (II)
• We could think of a MIPS instruction with three
registers operands
ADDX r1, r2, r3
adding to r1 the contents of the word at address
contents of r2 + contents of r3
• We would have r1 = r1 + Mem[r2+r3]
A bad MIPS instruction (III)
• It would be great for accessing arrays
– r2 will have starting address of array
– r3 would contain the array index multiplied
by 4
r3
r2
(fixed value)
(incremented after each step)
A bad MIPS instruction (IV)
• Adding this instruction would be a very bad idea
– Why?
Answer
• Instruction would require two steps using the
ALU
– Adding r2 and r3 to compute the address of
the memory operand (step 4)
– Adding the memory operand to r1
• New step would introduce a structural hazard by
preventing any other instruction to access the
ALU
My comment
• Careful design of the MIPS CPU and instruction
set should be noted
– Not true for older instructions sets
• IBM 360, DEC VAX, …
– Not true for X86 instruction sets
• CPU is designed to be compatible with an
existing instruction set
Designing instruction sets
for pipelining (I)
• All instructions should have the same length
– Can fetch future instructions before the
current one is decoded
• Have few instruction formats with register fields
always in the same position
– Can combine instruction decode and register
read steps
Designing instruction sets
for pipelining (II)
• Memory operands should only appear in load
and store instruction
– No instruction can use the ALU twice!
• Operands must be properly aligned in memory
– Can always access them in a single memory
cycle
Data hazards (I)
• Assume we have
add $s0, $t0, $t1
sub $t2, $s0, $t3
or
s0 = t0 + t1
t2 = s0 – t3
• Need result of add before proceeding with sub
instruction
Detail of steps
Cycle
add
sub
1
IF
2
ID/RR
IF
3
ALU
stall
4
RW
stall
5
6
ID/RR
ALU
• Second instruction must wait until first instruction
updated $s0 in cycle 4 before reading its value in
cycle 5
Data hazards (II)
• New value of $s0 computed by the add
instruction is not stored in $s0 until its step 5 has
completed
• New instruction must wait until add instruction
has performed its step 5 before performing its
step
Data hazards (III)
sub
add
Data hazards (IV)
• We lose two cycles during which nothing can be
done
• Cannot trust compiler to remove all data hazards
• Observe that new value of $s0 become
available at the end of step 3 of add instruction
– Add special circuitry to provide this value at
the end of step 2 of sub instruction
• Forwarding or bypassing
After forwarding
sub
add
Detail of steps
Cycle
add
sub
1
IF
2
3
ID/RR ALU
IF
ID/RR
4
RW
ALU
5
6
RW
• Second instruction now gets updated value at
the end of cycle 3 just in time to use it in cycle 4
– No stall cycles
Limitations (I)
• Forwarding worked very well because output of
step 4 of add was forwarded to be input of step 3
of sub
• Would not work as well if output of an instruction
step is need as input of instruction step of next
instruction
– Will still have one or more pipeline stalls
(bubbles)
Limitations (II)
• Assume we have
lw $s0, 20($t1)
sub $t2, $s0, $t3
or
s0 = Mem[t1+20]
t2 = s0 – t3
• Need new value of s0 before proceeding with
sub instruction
Limitations (III)
sub
add
Detail of steps
Cycle
lw
sub
1
IF
2
3
4
ID/RR ALU MEM
IF
ID/RR stall
5
RW
ALU
6
RW
• Even with forwarding second instruction must
wait until completion of memory access of first
instruction in cycle 4 before performing its ALU
step in cycle 5
– One stall cycle
A last word
• In many architectures, the floating point unit is a
significant source of structural hazards
– Less well adapted to pipelining
• The MIPS architecture assumes that we have
separate memories for instructions and data
– Having a single memory for both would result
in many more hazards
Control / jump hazards
• Happen whenever we have a conditional jump
• Consider the instructions
add $4, $5,$6
beq $1,$2, 40
or $7, $8, $9
• Need result of conditional branch (beq) before
deciding whether to execute next instruction (or)
Control hazards (II)
or
beq
Pipelined datapath
Datapaths for pipelined organization
•
Define five steps
1. Fetch instruction from memory (IF)
2. Instruction decode and register reads (ID)
3. Execute AL operation on ALU (EX)
4. Access operand in memory (MEM)
5. Write back results into a register (WB)
Datapaths for pipelined organization
•
Insert registers to save outputs of each step
before they get updated by th next step
1. IF/ID registers
2. ID/EX registers
3. EX/MEM registers
4. MEM/WB registers
A first try
New
New
IF/
New
New
Comments
• This first try is not correct
– Load instruction will not be implemented
correctly
• Address of destination register will be lost
as soon as new instruction will be fetched
• Must save it at each step
The almost correct datapaths
Register
address
follows
instruction
The almost correct datapaths
More problems
• Address of destination register is not always at
the same place in all instructions
– Could be instruction bits (20-16)
• For all I-format instructions that write into a
register
– Could be instruction bits (15-11)
• In R format instructions
Why?
• In R format instructions
opcode source source dest shamt funct
• In I format instructions
opcode source s/d
constant/address
The solution
• Add a multiplexer at stage EX
More about data hazards
• Consider
sub $2,$1,$3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
• Last four instructions depend on result of sub
More about data hazards
• $2 is updated at the end of last cycle of sub
• First instruction that would get the correct value
of $2 would be the add
More about data hazards
sub
and
or
add
sw
IF
ID+Reg
EX
MEM
WB
IF
ID+Reg
EX
MEM
WB
IF
ID+Reg
EX
MEM
IF
ID+Reg
EX
IF
ID+Reg
Adding a forwarding unit
More data hazards
•
We can forward the results of sub instruction at
the end of its EX step
– In time for all four following instructions
• To do that we need special forwarding unit
• Not all data hazards can be avoided
– lw followed by any instruction accessing the
loaded word
Why?
• lw loads word from RAM into memory
– Goes through IF, ID+Reg, EX, MEM and
WB steps
– Register value is updated at the end of WB
step
• Must delay any following instruction that wants to
access the contents of the register
Data hazard detection unit
• Detects hazards that cannot be avoided
• Inserts no operation instructions (nop)
– They do nothing!
More about control hazards
• Outcome of conditional branch is not known until
end of step EX
– beq and bne use arithmetic unit to evaluate
the branch condition
– If branch is taken, we must abort the two
following instructions
• Easy because they have not yet updated
anything
More about control hazards
beq
next
next
dest
IF
ID+Reg
EX
MEM
IF
ID+Reg
ABORT
IF
ABORT
IF
WB
ID+Reg
EX
More about control hazards
beq
next
dest
IF
ID+Reg
EX
IF
ABORT
IF
MEM
WB
ID+Reg
EX
MEM
Better implementation of beq/bne
MIPS Optimization
• Move comparison ahead to reduce the number
of aborted instructions
– Add a simple EQUAL/NOT EQUAL
comparison hardware that tests outputs of
register file
• Bitwise XOR then ORing the results
– Will return zero if the register contents
are identical
Explanations
• Moving the jump address calculation one step
ahead means that we will always do the
calculation even when it is not needed.
• Simple comparator duplicates one ALU function
New problem
• We need now the correct values of the input
registers in step ID
– More data hazards
add $t0, $t2, $t3
beq $t0, $s0, 400
– Data forwarding can reduce the number of
nops but not eliminate them
New data hazards
add
IF
ID+Reg
EX
MEM
WB
IF
ID+Reg
EX
nop
nop
beq
MEM
EXCEPTIONS AND INTERRUPTS
Interrupts (I)
•
Request to interrupt the flow of execution the
CPU
• Detected by the CPU hardware
– After it has executed the current instruction
– Before it starts the next instruction.
Interrupts (II)
•
When an interrupt occurs:
a) The current state of the CPU (program
counter, program status word, contents of
registers, and so forth) is saved, normally
on the top of a stack
b) A new CPU state is fetched
Interrupts (III)
• New state includes a new hardware-defined
value for the program counter
– Cannot “hijack” interrupts
• Process is totally transparent to the task being
interrupted
– A process never knows whether it has been
interrupted or not
Types of interrupts (I)
•
I/O completion interrupts
– Notify the OS that an I/O operation has
completed,
• Timer interrupts
– Notify the OS that a task has exceeded its
quantum of CPU time,
Types of interrupts (II)
•
Traps
– Notify the OS of a program error (division
by zero, illegal op code, illegal operand
address, ...) or a hardware failure
• System calls
– Notify OS that the running task wants to
submit a request to the OS
• Notification of another event
A surprising discovery
•
Programs do interrupt themselves!
Context switches
•
Each interrupt will result into
two context switches:
– One when the running task is interrupted
– Another when it regains the CPU
• Context switches are not cheap
• The overhead of any simple system call is
two context switches
Remember that for 4330!
Prioritizing interrupts (I)
•
Interrupt requests may occur while the system
is processing another interrupt
• All interrupts are not equally urgent (as it is
also in real life)
– Some are more urgent than other
– Also true in real life
Prioritizing interrupts (II)
•
The best solution is to prioritize interrupts
and assign to each source of interrupts a
priority level
– New interrupt requests will be allowed to
interrupt lower-priority interrupts but will
have to wait for the completion of all other
interrupts
• Solution is known as vectorized interrupts.
Example from real life
• Let us try to prioritize
– Phone is ringing
– Washer signals end of cycle
– Dark smoke is coming out of the kitchen
– …
• With vectorized interrupts, a phone call will never
interrupt another phone call
The solution
Smoke in the kitchen
Phone is ringing
End of washer cycle
More low-priority stuff
MIPS Implementation (I)
• Interrupts are a special case of a branch
– Use same techniques for handling control
hazards
• Almost all MIPS interrupts jump to the same
hardware address (x80000180)
– MIPS use a special register to pass along the
type of interrupt to the interrupt handler
• The Cause register
MIPS Implementation (II)
• MIPS also saves the address + 4 of the affected
instruction in a special register
– EPC register
• A STATUS register allows selective disabling of
interrupts
– Useful for handling short critical sections in
single-threaded kernel
Issues (I)
• Interrupted instruction may have to be restarted
– Typical for I/O completion interrupts
• Must then maintain precise exceptions that
accurately identify the instruction being
interrupted
– Not true for hardware interrupts
Issues (II)
• Must be able to restart instruction at the exact
point it was interrupted
– Not always easy on many architectures
• MIPS solution is to roll back everything and
restart instruction as if nothing had happened
– Easier on MIPS since register/memory update
is always the last step of any instruction
– Must still ensure that we can restore the
original values of all registers
Branch prediction
• CPU will try to predict whether a branch will be
taken or not
• Important for loops
– Branch is taken at every iteration
but last one
See speculative execution
Parallelism
• Instruction-level parallelism (ILP)
• Two ways:
– Increasing the depth of the pipeline:
• More steps can be executed in parallel
– Multiple issue:
• We duplicate some units (ALU)
– Two or more units can be at the same
pipeline stage
An example
• Could modify the toy MIPS architecture by
adding a second ALU:
– Would allow RR instructions be executed in
parallel with load and store instructions
– Would also need extra ports in the register
bank
• Faster but much more complex
Hazards
• Become an even bigger issue
• Some architectures assume that the compiler
will take care of all data hazards
– Will never issue sequence of instructions with
unsatisfied dependencies
• Other architectures check for problems
Speculative execution (I)
• Can speculate that
– A branch will not be taken (think of loops)
– A store than precedes a load will not store at
the address the load will use
and execute the instruction(s) hoping for the best
• If speculation is wrong, we must undo what we
have done
Speculative execution (II)
• Any speculation mechanism must include
methods to
– Check if the speculation was correct
– Undo the effect of the speculated instructions
• Quite complex
• Can be done by the compiler or the hardware
Fallacies
• Pipelining is easy
• Pipelining ideas can be implemented
independently of technologies
Pitfalls
• Instruction set must be pipelining friendly
Download