Lecture 1: Course Introduction and Overview

advertisement
Lecture 1:
Cost/Performance, DLX, Pipelining
Prof. Fred Chong
ECS 250A Computer Architecture
UC Davis Dept of Computer Science
Winter 1999
(Adapted from Patterson CS252 Copyright 1998 UCB)
FTC.W99 1
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
FTC.W99 2
Computer Architecture’s
Changing Definition
• 1950s to 1960s: Computer Architecture Course
Computer Arithmetic
• 1970s to mid 1980s: Computer Architecture Course
Instruction Set Design, especially ISA appropriate
for compilers
• 1990s: Computer Architecture Course
Design of CPU, memory system, I/O system,
Multiprocessors
FTC.W99 3
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
RAID
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
Pipelining and Instruction
Level Parallelism
FTC.W99 4
Computer Architecture Topics
P M
P M
S
°°°
P M
P M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
FTC.W99 5
ECS 250A Course Focus
Understanding the design techniques, machine
structures, technology factors, evaluation
methods that will determine the form of
computers in 21st Century
Technology
Parallelism
Programming
Languages
Applications
Computer Architecture:
• Instruction Set Design
• Organization
• Hardware
Operating
Systems
Measurement &
Evaluation
Interface Design
(ISA)
History
FTC.W99 6
Topic Coverage
Textbook: Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 2nd Ed., 1996.
• Performance/Cost, DLX, Pipelining, Caches, Branch Prediction
• ILP, Loop Unrolling, Scoreboarding, Tomasulo, Dynamic Branch
Prediction
• Trace Scheduling, Speculation
• Vector Processors, DSPs
• Memory Hierarchy
• I/O
• Interconnection Networks
• Multiprocessors
FTC.W99 7
ECS250A: Staff
Instructor: Fred Chong
Office: EUII-3031 chong@cs
Office Hours: Mon 4-6pm or by appt.
T. A:
Diana Keen
Office: EUII-2239 keend@cs
TA Office Hours: Fri 1-3pm
Class:
Mon 6:10-9pm
Text:
Computer Architecture: A Quantitative Approach,
Second Edition (1996)
Web page: http://arch.cs.ucdavis.edu/~chong/250A/
Lectures available online before 1PM day of lecture
Newsgroup: ucd.class.cs250a{.d}
FTC.W99 8
Grading
•
•
•
•
•
Problem Sets 35%
1 In-class exam (prelim simulation) 20%
Project Proposals and Drafts 10%
Project Final Report 25%
Project Poster Session (CS colloquium) 10%
FTC.W99 9
Assignments
• Read Ch 1-3
• Problem Set 1 - due Mon 1/25/99
– alone or in pairs
• Project Proposals - due Mon 1/25/99
–
–
–
–
groups of 2 or 3
see web page and links
e-mail to me and cc:diana about ideas - due Mon 1/18/99
pick 3 research papers
FTC.W99 10
VLSI Transistors
A
A
G
B
G
B
FTC.W99 11
CMOS Inverter
In
Out
In
Out
FTC.W99 12
CMOS NAND Gate
B
A
C
A
C
B
FTC.W99 13
Integrated Circuits Costs
IC cost = Die cost + Testing cost + Packaging cost
Final test yield
Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies
Die Area
¦ 2 * Die Area
{
Die Yield = Wafer yield * 1 +
Defects_per_unit_area * Die_Area

Die Cost goes roughly with die area4

}
FTC.W99 14
Real World Examples
Chip
Metal Line Wafer Defect Area Dies/ Yield Die Cost
layers width cost
/cm2 mm2 wafer
386DX
2 0.90 $900
1.0
43 360 71%
$4
486DX2
3 0.80 $1200
1.0
81 181 54%
$12
PowerPC 601 4 0.80 $1700
1.3 121 115 28%
$53
HP PA 7100 3 0.80 $1300
1.0 196
66 27%
$73
DEC Alpha
3 0.70 $1500
1.2 234
53 19%
$149
SuperSPARC 3 0.70 $1700
1.6 256
48 13%
$272
Pentium
3 0.80 $1500
1.5 296
40 9%
$417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,
Microprocessor Report, August 2, 1993, p. 15
FTC.W99 15
Cost/Performance
What is Relationship of Cost to Price?
• Component Costs
• Direct Costs (add 25% to 40%) recurring costs: labor,
purchasing, scrap, warranty
• Gross Margin (add 82% to 186%) nonrecurring costs:
R&D, marketing, sales, equipment maintenance, rental, financing
cost, pretax profits, taxes
• Average Discount to get List Price (add 33% to 66%): volume
discounts and/or retailer markup
List Price
Average
Discount
Avg. Selling Price
Gross
Margin
Direct Cost
Component
Cost
25% to 40%
34% to 39%
6% to 8%
15% to 33%
FTC.W99 16
Chip Prices (August 1993)
• Assume purchase 10,000 units
Chip
Area
mm2
386DX
Mfg. Price Multi- Comment
cost
plier
43
$9
$31
486DX2
81
PowerPC 601 121
$35
$77
$245
$280
3.4 Intense Competition
7.0 No Competition
3.6
DEC Alpha
234 $202 $1231
6.1 Recoup R&D?
Pentium
296 $473
2.0 Early in shipments
$965
FTC.W99 17
Summary: Price vs. Cost
100%
80%
Average Discount
60%
Gross Margin
40%
Direct Costs
20%
Component Costs
0%
Mini
5
4
W/S
PC
4.7
3.5
3.8
Average Discount
2.5
3
Gross Margin
1.8
2
Direct Costs
1.5
1
Component Costs
0
Mini
W/S
PC
FTC.W99 18
Technology Trends:
Microprocessor Capacity
100000000
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
FTC.W99 19
Memory Capacity
(Single Chip DRAM)
size
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
year
1980
1983
1986
1989
1992
1996
2000
2000
size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Year
FTC.W99 20
Technology Trends
(Summary)
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
FTC.W99 21
Processor Performance
Trends
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
FTC.W99 22
Processor Performance
(1.35X before, 1.55X now)
1200
1000
DEC Alpha 21264/600
1.54X/yr
800
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
IBM
AXP/
SunMIPSMIPS
9000/
DEC Alpha 4/266
-4/ M M/ RS/ 750 500
IBM POWER 100
260 2000 1206000
87 88 89 90 91 92 93 94 95 96 97
FTC.W99 23
Performance Trends
(Summary)
• Workstation performance (measured in Spec
Marks) improves roughly 50% per year
(2X every 18 months)
• Improvement in cost performance estimated
at 70% per year
FTC.W99 24
Computer Engineering
Methodology
Technology
Trends
FTC.W99 25
Computer Engineering
Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
FTC.W99 26
Computer Engineering
Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Simulate New
Designs and
Organizations
Workloads
FTC.W99 27
Computer Engineering
Methodology
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
FTC.W99 28
Measurement Tools
• Benchmarks, Traces, Mixes
• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
FTC.W99 29
The Bottom Line:
Performance (and Cost)
Plane
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concodre
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
– Throughput, bandwidth
FTC.W99 30
The Bottom Line:
Performance (and Cost)
"X is n times faster than Y" means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
--------------Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
FTC.W99 31
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F
of the task by a factor S, and the remainder of the
task is unaffected
FTC.W99 32
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
FTC.W99 33
Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
FTC.W99 34
Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
=
1.053
FTC.W99 35
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Megabytes per second
Cycles per second (clock rate)
FTC.W99 36
Aspects of CPU Performance
CPU time
= Seconds
= Instructions x
Program
Program
CPI
Program
Compiler
X
(X)
Inst. Set.
X
X
Technology
x Seconds
Instruction
Inst Count
X
Organization
Cycles
X
Cycle
Clock Rate
X
X
FTC.W99 37
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
S
CPU time = CycleTime *
i =1
CPI
i
* I
i
“Instruction Frequency”
n
CPI =
S
i =1
CPIi *
F
i
where Fi =
Ii
Instruction Count
Invest Resources where time is Spent!
FTC.W99 38
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq Cycles CPI(i)
ALU
50%
1
.5
Load
20%
2
.4
Store
10%
2
.2
Branch
20%
2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Typical Mix
FTC.W99 39
SPEC: System Performance
Evaluation Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
FTC.W99 40
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: S(Ti)/n or S(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/ S(1/Ri) or n/ S(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
• But do not take the arithmetic mean of
normalized execution time,
use the geometric: (Pxi)^1/n
FTC.W99 41
SPEC First Round
• One program: 99% of time in single line of code
• New front-end compiler could improve dramatically
800
700
500
400
300
200
100
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
0
gcc
SPEC Perf
600
Benchmark
FTC.W99 42
Impact of Means on
SPECmark89 for IBM 550
Ratio to VAX:
Program
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Mean
Time:
Before After Before After
30
29
49
51
35
34
65
67
47
47
510 510
46
49
41
38
78 144
258 140
34
34
183 183
40
40
28
28
78 730
58
6
90
87
34
35
33 138
20
19
54
72
124 108
Geometric
Ratio
1.33
Ratio
1.16
Weighted Time:
Before After
8.91
9.22
7.64
7.86
5.69
5.69
5.81
5.45
3.43
1.86
7.86
7.86
6.68
6.68
3.43
0.37
2.97
3.07
2.01
1.94
54.42 49.99
Arithmetic
Weighted
Arith.
Ratio
1.09
FTC.W99 43
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
FTC.W99 44
Instruction Set Architecture (ISA)
software
instruction set
hardware
FTC.W99 45
Interface Design
A good interface:
• Lasts through many implementations (portability,
compatibility)
• Is used in many differeny ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
use
Interface
use
use
imp 1
time
imp 2
imp 3
FTC.W99 46
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 432 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)
FTC.W99 47
Evolution of Instruction Sets
• Major advances in computer architecture are
typically associated with landmark instruction
set designs
– Ex: Stack vs GPR (System 360)
• Design decisions must take into account:
–
–
–
–
–
technology
machine organization
programming languages
compiler technology
operating systems
• And they in turn influence these
FTC.W99 48
A "Typical" RISC
•
•
•
•
32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store:
base + displacement
– no indirection
• Simple branch conditions
• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
FTC.W99 49
Example: MIPS
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
0
immediate
Rd
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
0
immediate
Jump / Call
31
26 25
Op
0
target
FTC.W99 50
Summary, #1
• Designing to Last through Trends
Capacity
•
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
6 yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
-------------Performance(Y)
FTC.W99 51
Summary, #2
• Amdahl’s Law:
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
• CPI Law:
CPU time
= Seconds
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize performance
• Die Cost goes roughly with die area4
• Can PC industry support engineering/research
FTC.W99 52
investment?
Pipelining: Its Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
A
B
C
D
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
FTC.W99 53
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
FTC.W99 54
Pipelined Laundry
Start work ASAP
6 PM
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
A
B
O
r
d
e
r
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
FTC.W99 55
Pipelining Lessons
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40
A
B
C
D
40
40
40 20
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
FTC.W99 56
Computer Pipelines
• Execute billions of instructions, so
throughput is what matters
• DLX desirable features: all instructions same
length, registers located in same place in
instruction format, memory operands only in
loads or stores
FTC.W99 57
5 Steps of DLX Datapath
Figure 3.1, Page 130
Instruction
Fetch
Instr. Decode
Reg. Fetch
IR
Execute
Addr. Calc
Memory
Access
Write
Back
L
M
D
FTC.W99 58
Pipelined DLX Datapath
Figure 3.4, page 137
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
Write
Back
Memory
Access
• Data stationary control
– local decode for each instruction phase / pipeline stage
FTC.W99 59
Visualizing Pipelining
Figure 3.3, Page 133
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
FTC.W99 60
Its Not That Easy for
Computers
• Limits to pipelining: Hazards prevent next
instruction from executing during its designated
clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline (missing sock)
– Control hazards: Pipelining of branches & other
instructionsstall the pipeline until the hazardbubbles” in the
pipeline
FTC.W99 61
One Memory Port/Structural Hazards
Figure 3.6, Page 142
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
Load
Instr 1
Instr 2
Instr 3
Instr 4
FTC.W99 62
One Memory Port/Structural Hazards
Figure 3.7, Page 143
Time (clock cycles)
Load
I
n
s
t
r.
O
r
d
e
r
Instr 1
Instr 2
stall
Instr 3
FTC.W99 63
Speed Up Equation for
Pipelining
CPIpipelined = Ideal CPI
+ Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth
x
Ideal CPI + Pipeline stall CPI
Speedup =
Pipeline depth
x
1 + Pipeline stall CPI
Clock Cycleunpipelined
Clock Cyclepipelined
Clock Cycleunpipelined
Clock Cyclepipelined
FTC.W99 64
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
FTC.W99 65
Data Hazard on R1
Figure 3.9, page 147
Time (clock cycles)
IF
I
n
s
t
r.
add r1,r2,r3
O
r
d
e
r
and r6,r1,r7
ID/RF
EX
MEM
WB
sub r4,r1,r3
or r8,r1,r9
xor r10,r1,r11
FTC.W99 66
Three Generic Data Hazards
InstrI followed by InstrJ
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
FTC.W99 67
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
FTC.W99 68
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipes
FTC.W99 69
Forwarding to Avoid Data Hazard
Figure 3.10, Page 149
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
FTC.W99 70
HW Change for Forwarding
Figure 3.20, Page 161
FTC.W99 71
Data Hazard Even with Forwarding
Figure 3.12, Page 153
Time (clock cycles)
I
n
s
t
r.
lw r1, 0(r2)
O
r
d
e
r
and r6,r1,r7
sub r4,r1,r6
or r8,r1,r9
FTC.W99 72
Data Hazard Even with Forwarding
Figure 3.13, Page 154
Time (clock cycles)
I
n
s
t
r.
O
r
d
e
r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
FTC.W99 73
Software Scheduling to Avoid
Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
FTC.W99 74
Control Hazard on Branches
Three Stage Stall
FTC.W99 75
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• DLX branch tests if register = 0 or ° 0
• DLX Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
FTC.W99 76
Pipelined DLX Datapath
Figure 3.22, page 163
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc.
Memory
Access
Write
Back
This is the correct 1 cycle
latency implementation!
FTC.W99 77
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% DLX branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
FTC.W99 78
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– DLX uses this
FTC.W99 79
Delayed Branch
• Where to get instructions to fill branch delay slot?
–
–
–
–
Before branch instruction
From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines,
multiple instructions issued per clock (superscalar)
FTC.W99 80
Evaluating Branch Alternatives
Pipeline speedup =
Scheduling
Branch
scheme
penalty
Stall pipeline
3
Predict taken
1
Predict not taken
1
Delayed branch
0.5
Pipeline depth
1 +Branch frequency Branch penalty
CPI
1.42
1.14
1.09
1.07
speedup v.
unpipelined
3.5
4.4
4.5
4.6
speedup v.
stall
1.0
1.26
1.29
1.31
Conditional & Unconditional = 14%, 65% change PC
FTC.W99 81
Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up Š Pipeline Depth; if ideal CPI is 1, then:
Speedup =
Pipeline Depth
Clock Cycle Unpipelined
X
1 + Pipeline stall CPI
Clock Cycle Pipelined
• Hazards limit performance on computers:
– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
FTC.W99 82
Download