Lec3-perf - ECE Users Pages - Georgia Institute of Technology

advertisement
ECE 4100/6100
Advanced Computer Architecture
Lecture 3 Performance
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Performance
• Execution/Response time (Latency)
– Elapsed time between start and completion of an
event
– How long my job takes?
• Throughput (Bandwidth)
– Total amount of work done within a given period
of time
– How many jobs done per unit time on a system?
2
CPU Performance
• Execution Time = Seconds / Program
Instructions
cycles
se conds


program Instruction
cycle
•
•
•
•
Programmer
Algorithms
ISA
Compilers
• Microarchitecture •
• System architecture
•
•
Microarchitecture,
pipeline depth
Circuit design
Technology
3
Combinational
Logic
F/F
F/F
Pipeline Stage
1 FO4
P4 pipe stage~ 16 FO4
• Optimal FO4 per pipe
– 6 to 8 [UT/Compaq, ISCA-29]
– 18 (15+3 latch) [IBM, MICRO-35]
Slide from Lecture 1 Pipelining
4
Architecture Comparison
• Many architecture research just make the following
assumptions
• Instructions / program is fixed
– Same binary ()
– Same compiler ()
– Same benchmark
• Seconds per cycle is constant ()
– Same frequency
– Same pipeline depth
– Typically a bad assumption today
• Focus on IPC or CPI
• It is more complicated for today’s architects !
5
Example: Calculating CPI
Run benchmark and collect workload characterization
(simulate, machine counters, or sampling)
Base Machine
Op
ALU
Load
Store
Branch
(Reg /
Freq
50%
20%
10%
20%
Reg)
Cycles
1
2
2
2
Typical Mix of
instruction types
in program
CPI(i)
.5
.4
.2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1%
performance improvement on reasonable benchmarks.
6
Performance Comparison
• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is n times faster than Y"
PerformanceX / PerformanceY = n
= speedup of X over Y
• Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
7
Performance Evaluation: Benchmark
• (Real) Programs
– In the form of collection of programs
– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC
• Kernels:
– Small key pieces of real programs
– E.g., Livermore Fortran Loops Kernels (LFK), Linpack
• Modified (or scripted)
– To focus on some particular aspects (e.g. remove I/O, focus on CPU)
• (Toy) Benchmarks
– Produce expected results
• Synthetic Benchmarks:
– Representative instruction mix
– E.g., Dhrystone, Whetstone
• Important for
– Architectural and microarchitectural design trade-off
– Competitive analysis of real products
8
Performance Summary Measurement
• Average of total execution time
n
n
1
1
Timei or
Weighti  Timei


n i 1
n i 1
• This is Arithmetic Mean (Weighted Arithmetic
Mean)
9
Performance Summary Measurement
n
n
1

i 1 Ratei
n
or n
Weighti

i 1 Ratei
• Ratei is a function of 1/Timei
• Used to represent the average “rate” such as
instruction per cycle (IPC)
10
Why Harmonic Mean?
• 30 mph for the first 10 miles
• 90 mph for the next 10 miles
• Average speed? (30+90)/2 = 60 mph??
• Wrong!
• Average speed = total distance / total time
• (10+10)/(10/30 + 10/90) = 45 mph
11
New Breed of Metrics
• Performance / Watt
– Performance achievable at the same cooling
capacity
• Performance / Joule (Energy)
– Achievable performance at the lifetime of the
same energy source (i.e., battery = energy)
– Equivalent to reciprocal of energy-delay product
(ED product)
12
Amdahl’s Law (Law of Diminishing Returns)
• Make the common case faster
• Speedup
= Perfnew / Perfold = Told / Tnew =
1
f
(1  f ) 
P
• Performance improvement from using faster mode
is limited by the fraction the faster mode can be
applied.
Told
f
(1 - f)
Tnew
(1 - f)
f/P
13
Amdahl’s Law Analogy
• Driving from Orlando to Atlanta
– 60 miles/hr from Orlando to Macon
– 120 miles/hr from Macon to Atlanta
– How much time you can save
compared against driving all the way
at 60 miles/hr from Orlando to
Atlanta?
• 6hr 45min vs. 7hr 30min = ~11%
speedup
• Key is to speed up the biggie portion, i.e.
speed up frequently executed blocks
14
Parallelism vs. Speedup
Amdahl's Law speed-up as a function of parallelism
100
Speed-up
P=1
P=2
P=4
P=8
P=16
P=32
P=64
10
1.97x
1.11x
1.33x
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Code portion in Faster mode (f)
15
Gustafson’s Law
• Amdahl’s Law killed massive parallel processing (MPP)
• Gustafson came to rescue
Tnew
Seq
Parallel
Told
Seq
Assume:
P * Parallel Time
Seq + Parallel = 1
(Tnew)

Speedup = Seq + p * (1 – Seq) where p=parallel factor
If Seq diminishes with increased problem size, Speedup  p
16
Amdahl versus Gustafson
Who is right?
17
The Principle of Locality
• Knuth made the original observation about program locality
in 1971.
– … less than 4 percent of a program generally accounts for
more than half of its running time.
• 90/10 rule: a program spends 90% of its execution time in
only 10% of the code
• Two types of locality
– Temporal locality (locality in time)
– Spatial locality (locality in space)
• Memory subsystem design heavily leverages the locality
concept for better performance
18
Example of Performance Evaluation (I)
Operation
Frequency
Clock cycle
count
ALU Ops (reg-reg)
43%
1
Loads
21%
2
Stores
12%
2
Branches
24%
2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
19
Example of Performance Evaluation (I)
Operation
Frequency
Clock cycle
count
ALU Ops (reg-reg)
43%
1
Loads
21%
2
Stores
12%
2
Branches
24%
2
Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory.
These new reg-mem instructions spend 2 clock cycles. Also assume that the
extended instruction set increase branch’s clock by 1, but no impact to cycle time.
Would this change improve performance ?
Cyclesold  0.431  0.21 2  0.12 2  0.24* 2  1.57
Cyclesnew  0.25 0.43 2  (0.43 0.25 0.43) 1  (0.21 0.25* 0.43) * 2  0.12* 2  0.24* 3  1.703
20
Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
21
Example of Performance Evaluation (II)
FP instructions = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2% of all instructions, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2
• Design Option 2: decease the average CPI of all FP instructions to 2.5
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
Option 1 CPI = 2.0 – 2%*(20-2) = 1.64
Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625
Speedup of Option 1 = 2/1.64 = 1.2195
Speedup of Option 2 = 2/1.625 = 1.2308
22
Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP insturctionss = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
23
Example of Performance Evaluation (III)
Clock freq = 1.4 GHz
FP insturctionss = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5,
clock freq = 1.1 GHz
Original CPI = 2.0,
IPC = 1/2,
Inst/Sec = ½*1.4G = 0.7G inst/s
Option 1 CPI = 1.64,
IPC = 1/1.64,
Inst/Sec = 1/1.64*1.2G = 0.73G inst/s
Option 2 CPI = 1.625,
IPC = 1/1.625,
Inst/Sec = 1/1.625*1.1G = 0.68G inst/s
24
Download