Datorarkitektur och operativsystem Lecture 5 1

advertisement
Datorarkitektur och operativsystem
Lecture 5
1
Components of a Computer
Processor
 Datapath
 Component of the
p
processor
that
performs arithmetic
operations
p
 Control
 Component of the
p
processor
that
commands the
datapath,
p
memory,
y
I/O devices according
to the instructions of
the memory
Goal: We can correctly interpret a datapath and control from a schematic diagram
Datapath With Control
Chapter 4 — The Processor —
4
Problem: Consider the instruction: AND Rd, Rt, Rs. What are the control signals generated by the control unit in this figure ?
Chapter 4 — The Processor — 5
 ANSWER :
 Regwrite asserted to write
 Mux (before ALUs) signaled to read from
registers and not immediate
 Mux
M (before re
registers)
isters) si
signaled
naled to use
se ALU and
not data memory
 ALU is signaled
l d to perform
f
AND
 Branch not asserted
 Memwrite not asserted
 Memread not asserted
For some instructions, some control signals do not matter. E.g, Mux
i
i
l i
l d
(before registers) in a sw instruction.
6
Problem: If the only thing we need to do in a processor is fetch consecutive instructions (see figure), what would the cycle time be ?
Consider that the logic blocks have the following latencies. Other units have negligible latencies
units have negligible latencies. I ‐Mem
ADD
Mux
ALU
Regs
D‐Mem
Sign‐
Extend
Shift‐left‐
2
200 ps
70ps
20ps
90ps
90ps
250ps
15ps
10ps
 ANSWER:
 200ps because
 I-Mem has larger latency than the Add unit
 I-Mem and Add are in parallel paths
8
A Recap: Combinational Elements
 AND-gate
AND gate

 Y =A & B
A
B

Multiplexer

A
+
Y = A + B
B
Y


Adder
Y = S ? I1 : I0
I0
I1
M
u
x
S
Chapter 4 — The Processor —
9
Arithmetic/Logic Unit
/

Y = F(A, B)
( , )
A
ALU
Y
B
F
Y
Y
A Recap: State Elements
 Registers
 Data Memory
 Instruction Memory
 Clocks are needed to decide when an element
that contains state should be updated
10
Clocking Methodology
 We study
 Ed
Edge triggered
i
d methodology
h d l
• Because it is simple
 Edge triggered methodology:
 All state changes occur on a clock edge
Chapter 4 — The Processor —
11
Clocking Methodology
 Longest delay determines clock period
Chapter 4 — The Processor —
12
Single Cycle: Performance
 Assume time for stages is
 100ps for register read or write
 200ps
p for other stages
g
Instr
Instr fetch Register
read
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
200 ps latency 100 ps latency Single Cycle: Performance
Chapter 4 — The
Processor — 14
 Pipelining = Overlap the stages
Performance Issues
 LLongest delay
d l d
determines
i
clock
l k period
i d
 Critical path: load instruction
 Instruction memory  register file  ALU 
data memoryy  register
g
file
Performance Issues
 Violates design principle
 Making the common case fast
• (read text in p.329-330)
 Improve performance by pipelining!
Recap: the stages of the datapath

We can look
W
l k at the
h datapath
d
h as five
fi stages, one step
per stage
1. IF: Instruction fetch from memory
g
read
2. ID: Instruction decode & register
3. EX: Execute operation or calculate address
4 MEM: Access memory operand
4.
5. WB: Write result back to register
The 5 Stages
g
200 ps latency 100 ps latency  Pipelining = Overlap the stages
Chapter 4 — The
Processor — 19
Pipeline
Chapter 4 — The
Processor — 20
Clock Cycle
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Chapter 4 — The
Processor — 21
 Even if some stages take only 100ps instead of 200ps,
the pipelined execution clock cycle must have the
worst case clock cycle time of 200ps
22
Speedup from Pipelining
 If all stages are balanced
 i.e., all take the same time
 Time between instructionspipelined
= Time between instructionsnonpipelined
i li d
Number of stages
Chapter 4 — The Processor — 23
Speedup from Pipelining
 If the stages are not balanced (not equal), speedup is
less . Look at our example:
 Time between instructions (non-pipelined) = 800ps
 Number of stages = 5
 Time between instructions (pipelined) = 800/5 =160 ps
 But, what did we get ?
 Also read text in p.334
p
Chapter 4 — The Processor — 24
Speedup from Pipelining
 Recall from Chapter 1: Throughput versus latency
p
p in ppipelining
p
g is due to increased throughput
g p
 Speedup
 Latency (time for each instruction) does not
decrease
Chapter 4 — The Processor — 25
Hazards
 Situations that prevent starting the next instruction
in the next cycle are called hazards
 There are 3 kinds of hazards
 Structure hazards
 Data hazards
 Control hazards
Chapter 4 — The Processor — 26
Structure Hazards
 When an instruction cannot execute in proper cycle
due to a conflict for use of a resource
Chapter 4 — The
Processor — 27
Example of structural hazard
 Imagine
 a MIPS p
pipeline
p
with a single
g data and instruction memoryy
 a fourth additional instruction in the pipeline; when this instruction is
trying to fetch from instruction memory, the first instruction is
fetching of data memory – leading to a hazard
 Hence, pipelined datapaths require separate instruction/data memories
 Or separate instruction/data caches
Data Hazards
 An instruction depends on completion of data access
b a previous
by
i
i
instruction
i
Chapter 4 — The
Processor — 29
Data Hazards
 add
$s0, $t0, $t1
sub
b $t2,
$ 2 $s0,
$ 0 $t3
$ 3
 For the above code, when can we start the second
instruction ?
Data Hazards
 add
sub
Chapter 4 — The
Processor — 31
$s0,
, $t0,
, $t1
$t2, $s0, $t3
 Bubble or the ppipeline
p
stall is used to resolve a hazard
 BUT it leads to wastage of cycles = performance
deterioration
 Solve the performance problem by ‘forwarding’ the
required data
32
Forwarding (aka Bypassing)
 Use result when it is computed : don
don’tt wait for it to
be stored in a register
 Requires extra connections in the datapath
Forwarding (aka Bypassing)
 add
$s0, $t0, $t1
sub $t2, $s0, $t3
 For the above code, draw the datapath
p
with
forwarding
Forwarding (aka Bypassing)
Chapter 4 — The
Processor — 35
Load Use Data Hazard
Load-Use
 Can’t always avoid stalls by forwarding
• If value not computed when needed
Chapter 4 — The
Processor — 36
Code Scheduling
g to Avoid Stalls


Reorder code to avoid use of load result in the next
i
instruction
i
C code for A = B + E;
; C = B + F;
;
lw
lw
add
sw
lw
add
sw
$t1,
$t1
$t2,
$t3,
$t3,
$t4,
$
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1,
$
$t4
$
16($t0)
13 cycles
l
Chapter 4 — The Processor — 37
Code Scheduling
g to Avoid Stalls


Reorder code to avoid use of load result in the next
i
instruction
i
C code for A = B + E;
; C = B + F;
;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t1
$t2,
$t3,
$t3,
$t4,
$
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1,
$
$
$t4
16($t0)
13 cycles
l
Chapter 4 — The Processor — 38
Code Scheduling
g to Avoid Stalls


Reorder code to avoid use of load result in the next
i
instruction
i
C code for A = B + E;
; C = B + F;
;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t1
$t2,
$t3,
$t3,
$t4,
$
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1,
$
$
$t4
16($t0)
13 cycles
l
lw
lw
lw
add
sw
add
sw
$t1,
$t1
$t2,
$t4,
$t3,
$t3,
$
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1,
$
$
$t4
16($t0)
11 cycles
l
Chapter 4 — The Processor — 39
Control Hazards
 When the proper instruction cannot execute
in the proper pipeline clock cycle because the
instruction that was fetched is not the one
that is need
Chapter 4 — The
Processor — 40
Control Hazards
 Branch determines flow of control
 Fetching next instruction depends on branch
outcome
 Naïve/Simple
p Solution: Stall if we fetch a branch
instruction
 Drawback: Several cycles
y
lost
 Improve by adding hardware so that already in
second stage
g we can know the address of the next
instruction
Chapter 4 — The
Processor — 41
Stall on Branch

Wait until branch outcome determined before
fetching next instruction

but with extra hardware to minimize the stalled cycles
y
Chapter 4 — The
Processor — 42
Branch Prediction
 Longer
g pipelines
pp
can’t readilyy determine
branch outcome early (in second stage)
 Stall penalty becomes unacceptable
 Solution: predict outcome of branch
 Only stall if prediction is wrong
 We will not study
y the details
Chapter 4 — The
Processor — 43
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Chapter 4 — The
Processor — 44
Recap: Hazards
 Situations that prevent starting the next instruction
in the next cycle
 Structure
S
hhazards
d
 A required resource is busy
 Data hazard
 Need to wait for previous instruction to complete
its data read/write
 Control hazard
 Deciding on control action depends on previous
i t ti
instruction
Chapter 4 — The Processor — 45
Pipeline Summary
 Pi
Pipelining
li i improves
i
performance
f
bby iincreasing
i
instruction throughput
 Executes multiple instructions in parallel
 Each instruction has the same latency
 Subject to hazards
 Structure, data, control
 Instruction set design affects complexity of
pipeline implementation (p.335)
Chapter 4 — The
Processor — 46
Administrative Detail
 Homework 2 is coming up very soon
47
Administrative Detail:
A Rule for Homework
 The deadline may NOT be extended. No
exceptions
p
for sickness,, forgetfulness,
g
,… 
 Why?
48
Administrative Detail:
A Rule for Homework
 The deadline may NOT be extended. No
exceptions
p
for sickness,, forgetfulness,
g
,… 
 Why ?
 To
T ensure fairness
f i
to your classmates
l
who
h
worked hard and could submit only half of the
questions.
i
If someone submits
b i llater and
d completes
l
all questions, he/she should NOT be rewarded!
 It is just like an exam; so if you miss it, it is over.
Only that, for this special exam, you do it at home!
49
From the Textbook
 Last week: 4.1, 4.2, 4.3
 Today: 44.55
50
50
Download