CSCE 430/830 Computer Architecture Reviews of Quantitative Principles, Memory Adopted from

advertisement
CSCE 430/830 Computer Architecture
Reviews of Quantitative Principles, Memory
Hierarchy & Pipeline Design Basics
Adopted from
Professor David Patterson
Electrical Engineering and Computer Sciences
University of California, Berkeley
Instruction Set Architecture: Critical Interface
software
instruction set
hardware
• Properties of a good abstraction
–
–
–
–
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
2
Example: MIPS
r0
r1
°
°
°
r31
PC
lo
hi
0
Programmable storage
Data types ?
2^32 x bytes
Format ?
31 x 32-bit GPRs (R0=0)
Addressing Modes?
32 x 32-bit FP regs (paired DP)
Operations?
HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
See PP. 13 on the
SLL, SRL, SRA, SLLV, SRLV, SRAV
textbook for MIPS64
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR
Control
32-bit instructions on word boundary
J, JAL, JR, JALR
BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
3
CPI
5) Processor performance equation
inst count
CPU time
= Seconds
= Instructions x
Program
Program
CPI
Program
Compiler
X
(X)
Inst. Set.
X
X
Cycle
Clock Rate
X
Technology
7/27/2016
x Seconds
Instruction
Inst Count
X
Organization
Cycles
Cycle time
CSCE 430/830, Review of Pipeline Design & Basics
X
X
5
Define and quantity power ( 1 / 2)
• For CMOS chips, traditional dominant energy
consumption has been in switching transistors,
called dynamic power
2
Powerdynamic  1 / 2  CapacitiveLoad  Voltage  FrequencySwitched
• For mobile devices, energy better metric
2
Energydynamic  CapacitiveLoad  Voltage
• For a fixed task, slowing clock rate (frequency
switched) reduces power, but not energy
• Capacitive load a function of number of transistors
connected to output and technology, which
determines capacitance of wires and transistors
• Dropping voltage helps both, so went from 5V to 1V
• To save energy & dynamic power, most CPUs now
turn off clock of inactive modules (e.g. Fl. Pt. Unit)
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
6
Define and quantity power (2 / 2)
• Because leakage current flows even when a
transistor is off, now static power important too
Powerstatic  Currentstatic  Voltage
• Leakage current increases in processors with
smaller transistor sizes
• Increasing the number of transistors increases
power even if they are turned off
• In 2006, goal for leakage is 25% of total power
consumption; high performance designs at 40%
• Very low power systems even gate voltage to
inactive modules to control loss due to leakage
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
7
Define and quantity dependability (1/3)
•
•
How decide when a system is operating properly?
Infrastructure providers now offer Service Level
Agreements (SLA) to guarantee that their
networking or power service would be dependable
• Systems alternate between 2 states of service
with respect to an SLA:
1. Service accomplishment, where the service is
delivered as specified in SLA
2. Service interruption, where the delivered service
is different from the SLA
• Failure = transition from state 1 to state 2
• Restoration = transition from state 2 to state 1
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
8
Define and quantity dependability (2/3)
•
Module reliability = measure of continuous service
accomplishment (or time to failure).
2 metrics
1. Mean Time To Failure (MTTF) measures Reliability
2. Failures In Time (FIT) = 1/MTTF, the rate of failures
•
•
Traditionally reported as failures per billion hours of operation
Mean Time To Repair (MTTR) measures Service
Interruption
– Mean Time Between Failures (MTBF) = MTTF+MTTR
•
•
Module availability measures service as alternate
between the 2 states of accomplishment and
interruption (number between 0 and 1, e.g. 0.9)
Module availability = MTTF / ( MTTF + MTTR)
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
9
The Principle of Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
CSCE430/830
Review of Mem. Hierarchy
Amdahl’s Law

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  
Speedupenhanced 

Speedupoverall 
ExTimeold

ExTimenew
1
1  Fractionenhanced  
Fractionenhanced
Speedupenhanced
Best you could ever hope to do:
Speedupmaximum
CSCE430/830
7/27/2016
1

1 - Fractionenhanced 
CSCE 430/830, Review of
13
Review of Mem. Hierarchy
Memory Hierarchy - the Big Picture
• Problem: memory is too slow and too small
• Solution: memory hierarchy
Processor
Control
Size (bytes):
CSCE430/830
L1 On-Chip
Cache
Speed (ns):
Registers
Datapath
0.25-0.5
<1K
L2
Off-Chip
Cache
Main
Memory
(DRAM)
0.5-25
80-250
<16M
<16G
Secondary
Storage
(Disk)
5,000,000 (5ms)
>100G
Review of Mem. Hierarchy
Fundamental Cache Questions
• Q1: Where can a block be placed in the upper level?
(Block placement)
• Q2: How is a block found if it is in the upper level?
(Block identification)
• Q3: Which block should be replaced on a miss?
(Block replacement)
• Q4: What happens on a write?
(Write strategy)
CSCE430/830
Review of Mem. Hierarchy
Q1: Where can a block be placed in the upper level?
• Block 12 placed in 8 block cache:
– Fully associative, direct mapped, 2-way set
associative
– S.A. Mapping = (Block Number) Modulo (Number
Sets)
2-Way Assoc
Direct Mapped
Full Mapped
(12 mod 4) = 0
(12 mod 8) = 4
01234567
01234567
01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
CSCE430/830
Review of Mem. Hierarchy
Q2: How is a block found if it is in the upper level?
• Block offset selects the desired data from the
block.
– Len. Block Offset = log2 (Cache block size)
• Index selects the set
– Number of sets = Number of Cache blocks / Number of Ways
– Len. Index = log2 (Number of sets)
• Tag is compared against it for a hit.
– Len. Tag = Len. Mem Addr – Len. Index – Len. Block Offest
Block Address
Tag
CSCE430/830
Index
Block
Offset
Review of Mem. Hierarchy
Q2: How is a block found if it is in the upper level?
• Tag on each block
– No need to check index or block offset
• Increasing associativity shrinks index,
expands tag
Block Address
Tag
CSCE430/830
Index
Block
Offset
Review of Mem. Hierarchy
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
– Random
– LRU (Least Recently Used)
Assoc:
Size
16 KB
64 KB
256 KB
CSCE430/830
2-way
LRU Ran
5.2% 5.7%
1.9% 2.0%
1.15% 1.17%
4-way
LRU Ran
4.7% 5.3%
1.5% 1.7%
1.13% 1.13%
8-way
LRU
Ran
4.4%
5.0%
1.4%
1.5%
1.12% 1.12%
Review of Mem. Hierarchy
Q4: What happens on a write?
Write-Through
Policy
Data written to cache
block
Write-Back
Write data only to the
cache
also written to lowerlevel memory
Update lower level
when a block falls out
of the cache
Debug
Easy
Hard
Do read misses
produce writes?
No
Yes
Do repeated writes
make it to lower
level?
Yes
No
Additional option (on miss)-- let writes to an un-cached
address allocate a new cache line (“write-allocate”).
CSCE430/830
Review of Mem. Hierarchy
Cache Performance Measures
• Hit rate: fraction found in the cache
– So high that we usually talk about Miss rate = 1 - Hit Rate
• Hit time: time to access the cache
• Miss penalty: time to replace a block from lower level,
including time to replace in CPU
– access time: time to acccess lower level
– transfer time: time to transfer block
• Average memory-access time (AMAT)
= Hit time + Miss rate x Miss penalty (ns or clocks)
CSCE430/830
Review of Mem. Hierarchy

Six basic cache optimizations:

Larger block size




Reduces overall memory access time
Giving priority to read misses over writes


Reduces conflict misses
Increases hit time, increases power consumption
Higher number of cache levels


Increases hit time, increases power consumption
Higher associativity


Reduces compulsory misses
Increases capacity and conflict misses, increases miss penalty
Larger total cache capacity to reduce miss rate


Introduction
Memory Hierarchy Basics
Reduces miss penalty
Avoiding address translation in cache indexing

Reduces hit time
Copyright © 2012, Elsevier Inc. All rights reserved.
24
Introduction
Memory Hierarchy Basics

Ten advanced cache optimizations:










Small and simple 1st level caches
Way prediction
Pipelining cache
Nonblocking caches
Multibanked caches
Critical word first, early restart
Merging write buffer
Compiler optimizations
Hardware prefetching
Compiler prefetching
See PP. 96 on the
textbook for a summary
Copyright © 2012, Elsevier Inc. All rights reserved.
25
Details of Page Table
Page Table
Physical
Memory Space
Virtual Address
12
offset
frame
frame
V page no.
frame
Page Table
frame
virtual
address
Page Table
Base Reg
index
into
page
table
V
Access
Rights
PA
table located
in physical P page no.
memory
offset
12
Physical Address
• Page table maps virtual page numbers to physical
frames (“PTE” = Page Table Entry)
• Virtual memory => treat memory  cache for disk
CSCE430/830
Review of Mem. Hierarchy
The TLB caches page table entries
Physical and virtual
pages must be the
same size!
TLB caches
page table
entries.
virtual address
page
Physical
frame
address
for ASID
off
Page Table
2
0
1
3
physical address
TLB
frame page
2
2
0
5
CSCE430/830
page
off
MIPS handles TLB misses in
software (random
replacement). Other
machines use hardware.
V=0 pages either
reside on disk or
have not yet been
allocated.
OS handles V=0
Review of Mem. Hierarchy
“Page fault”
Summary of Virtual Machine Monitor
• Virtual Machine Revival
– Overcome security flaws of modern OSes
– Processor performance no longer highest priority
– Manage Software, Manage Hardware
• “… VMMs give OS developers another opportunity
to develop functionality no longer practical in
today’s complex and ossified operating systems,
where innovation moves at geologic pace .”
[Rosenblum and Garfinkel, 2005]
• Virtualization challenges for processor, virtual
memory, I/O
– Paravirtualization, ISA upgrades to cope with those difficulties
• Xen as example VMM using paravirtualization
– 2005 performance on non-I/O bound, I/O intensive apps: 80% of
native Linux without driver VM, 34% with driver VM
• Opteron memory hierarchy still critical to
performance
CSCE430/830
Review of Mem. Hierarchy
Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)
7/27/2016
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
CSCE 430/830, Review of Pipeline Design & Basics
Reg
DMem
Reg
35
Classic RISC Pipeline
Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
36
Pipelining is not quite that easy!
• Limits to pipelining: Hazards prevent next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Root Cause: Resource Contention
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline (missing sock)
– Root Cause: Data Dependence
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps).
– Root Cause: Control Trandfer
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
37
One Memory Port/Structural Hazards
Figure A.4, Page A-14
Time (clock cycles)
7/27/2016
Reg
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 4
Ifetch
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Ifetch
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Reg
Reg
DMem
38
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
Time (clock cycles)
Stall
Instr 3
7/27/2016
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
39
Data Dependence and Hazards
•
InstrJ is data dependent (aka true dependence) on
InstrI:
1. InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
2. or InstrJ is data dependent on InstrK which is dependent on InstrI
•
•
•
If two instructions are data dependent, they cannot
execute simultaneously or be completely overlapped
Data dependence in instruction sequence
 data dependence in source code  effect of
original data dependence must be preserved
If data dependence caused a hazard in pipeline,
called a Read After Write (RAW) hazard
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
40
Name Dependence #1: Anti-dependence
• Name dependence: when 2 instructions use same
register or memory location, called a name, but no
flow of data between the instructions associated
with that name; 2 versions of name dependence
• InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”
• If anti-dependence caused a hazard in the pipeline,
called a Write After Read (WAR) hazard
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
41
Name Dependence #2: Output dependence
• InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”
• If anti-dependence caused a hazard in the pipeline,
called a Write After Write (WAW) hazard
• Instructions involved in a name dependence can
execute simultaneously if name used in instructions is
changed so instructions do not conflict
– Register renaming resolves name dependence for regs
– Either by compiler or by HW
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
42
RAW, WAR, WAW
• Read After Write
• Write After Read
• Write After Write
• Condition
– At least one register is used by both 2 instructions
– At least one write instruction to that register between 2
instructions
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
43
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
or
r8,r1,r9
xor r10,r1,r11
7/27/2016
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
CSCE 430/830, Review of Pipeline Design & Basics
Reg
Reg
DMem
45
Reg
Forwarding to Avoid LW-SW Data Hazard
Figure A.8, Page A-20
or
r8,r6,r9
xor r10,r9,r11
7/27/2016
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sw r4,12(r1)
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
CSCE 430/830, Review of Pipeline Design & Basics
Reg
Reg
DMem
46
Reg
Forwarding Schemes
• Goal: start the work earlier
• Forwarding 1:
– EX/MEM Pipeline Register => Input of ALU
• Forwarding 2:
– MEM/WB Pipeline Register => Input of ALU
• Forwarding 3: Special Case
– Register File => Register File
– Because writing a register is done in the first half of the clock
cycle, and reading the same register is done in the second
half of a clock cycle.
• Forwarding 4: for LD/SW Data Hazard
– MEM/WB Pipeline Register => Input of Memory Access
– Loading data from memory  Writing data to memory
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
47
HW Change for Forwarding
Figure A.23, Page A-37
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory
mux
Immediate
What circuit detects and resolves this hazard?
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
48
Data Hazard Even with Forwarding
Figure A.9, Page A-21
and r6,r1,r7
or
7/27/2016
r8,r1,r9
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
CSCE 430/830, Review of Pipeline Design & Basics
Reg
DMem
ALU
O
r
d
e
r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2) Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
49
Reg
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
and r6,r1,r7
or r8,r1,r9
7/27/2016
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
How isCSCE
this detected?
430/830, Review of Pipeline Design & Basics
DMem
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
DMem
50
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
Compiler optimizes for performance. Hardware checks for safety.
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
51
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
r6,r1,r7
Ifetch
DMem
ALU
18: or
Reg
ALU
14: and r2,r3,r5
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stage Stall
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
Reg
Reg
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
52
Reg
DMem
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
• Data stationary control
– local
7/27/2016
decode CSCE
for each
instruction
phase
/ pipeline stage
430/830,
Basic Pipelining
& Performance
53
Pipelined MIPS Datapath
Figure A.24, page A-38
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Sign
Extend
RD
RD
RD
• Interplay of instruction set design and cycle time.
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
54
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
55
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– MIPS uses this
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
56
Scheduling Branch Delay Slots (Fig A.14)
A. From before
branch
add $1,$2,$3
if $2=0 then
delay slot
become
s
if $2=0 then
add $1,$2,$3
B. From branch
target
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
delay slot
become
s
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
C. From fall
through
add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
OR $7,$8,$9
become
add
$1,$2,$3
s
if $1=0 then
sub $4,$5,$6
OR $7,$8,$9
• A is the best choice, fills delay slot
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
57
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
58
Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency Branch penalty
Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken
Scheduling
Branch CPI speedup v. speedup v.
scheme
penalty
unpipelined
stall
Stall pipeline
3 1.60
3.1
1.0
Predict taken
1 1.20
4.2
1.33
Predict not taken
1 1.14
4.4
1.40
Delayed branch
0.5 1.10
4.5
1.45
7/27/2016
CSCE 430/830, Review of Pipeline Design & Basics
59
Download