Lecture 2a - Department of Computer Engineering

advertisement
Computer Performance and
Cost
Pradondet Nilagupta
Fall 2005
(original notes from Randy Katz, & Prof. Jan M.
Rabaey , UC Berkeley
Prof. Shaaban, RIT)
204521 Digital System Architecture
March 14, 2016
API
ISA
Link
I/O Chan
Review: What is Computer
Architecture?
Interfaces
Technology
IR
Regs
Machine Organization
Applications
Computer
Architect
Measurement &
Evaluation
204521 Digital System Architecture
March 14, 2016
2
Review: Computer System Components
CPU Core
1 GHz - 3.8 GHz
4-way Superscaler
RISC or RISC-core (x86):
Deep Instruction Pipelines
Dynamic scheduling
Multiple FP, integer FUs
Dynamic branch prediction
Hardware speculation
SDRAM
PC100/PC133
100-133MHZ
64-128 bits wide
2-way inteleaved
~ 900 MBYTES/SEC )64bit)
Current Standard
Double Date
Rate (DDR) SDRAM
PC3200
200 MHZ DDR
64-128 bits wide
4-way interleaved
~3.2 GBYTES/SEC
(one 64bit channel)
~6.4 GBYTES/SEC
(two 64bit channels)
L1
CPU
All Non-blocking caches
L1 16-128K
1-2 way set associative (on chip), separate or unified
L2 256K- 2M 4-32 way set associative (on chip) unified
L3 2-16M
8-32 way set associative (off or on chip) unified
L2
Examples: Alpha, AMD K7: EV6, 200-400 MHz
Intel PII, PIII: GTL+ 133 MHz
Intel P4
800 MHz
L3
Caches
Front Side Bus (FSB)
Off or On-chip
adapters
Memory
Controller
Memory Bus
I/O Buses
NICs
Controllers
Example: PCI, 33-66MHz
32-64 bits wide
133-528 MBYTES/SEC
PCI-X 133MHz 64 bit
1024 MBYTES/SEC
Memory
RAMbus DRAM (RDRAM)
400MHZ DDR
16 bits wide (32 banks)
~ 1.6 GBYTES/SEC
204521 Digital System Architecture
Disks
Displays
Keyboards
Networks
I/O Devices:
North
Bridge
South
Bridge
I/O Subsystem
Chipset
March 14, 2016
3
The Architecture Process
Estimate
Cost &
Performance
Sort
New concepts
created
Bad ideas
204521 Digital System Architecture
March 14, 2016
Mediocre
ideas
Good
ideas
4
Performance Measurement and
Evaluation
Many dimensions to
computer performance
– CPU execution time
P
• by instruction or sequence
– floating point
– integer
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance
C
• bandwidth
• seeks
• pixels or polygons per
second
M
Relative importance depends
on applications
204521 Digital System Architecture
March 14, 2016
5
Evaluation Tools
Benchmarks, traces, & mixes
– macrobenchmarks & suites
• application execution time
– microbenchmarks
• measure one aspect of
performance
– traces
• replay recorded accesses
– cache, branch, register
Simulation at many levels
– ISA, cycle accurate, RTL, gate,
circuit
• trade fidelity for simulation rate
Area and delay estimation
Analysis
– e.g., queuing theory
204521 Digital System Architecture
March 14, 2016
MOVE
BR
LOAD
STORE
ALU
39%
20%
20%
10%
11%
LD 5EA3
ST 31FF
….
LD 1EA2
….
6
Benchmarks
Microbenchmarks
Macrobenchmarks
– measure one
performance dimension
– application execution
time
• cache bandwidth
• main memory
bandwidth
• procedure call
overhead
• FP performance
• measures overall
performance, but on
just one application
204521 Digital System Architecture
March 14, 2016
Macro
Micro
– weighted combination
of microbenchmark
performance is a good
predictor of application
performance
– gives insight into the
cause of performance
bottlenecks
Applications
Perf. Dimensions
7
Some Warnings about
Benchmarks
Benchmarks measure the
whole system
– application
– compiler
– operating system
– architecture
– implementation
Popular benchmarks
typically reflect
yesterday’s programs
– computers need to be
designed for
tomorrow’s programs
204521 Digital System Architecture
Benchmark timings often
very sensitive to
– alignment in cache
– location of data on disk
– values of data
Benchmarks can lead to
inbreeding or positive
feedback
– if you make an
operation fast (slow) it
will be used more (less)
often
March 14, 2016
• so you make it faster
(slower)
– and it gets used
even more (less)
» and so on…
8
Choosing Programs To Evaluate
Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the target
machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical
of targeted applications or workload (e.g SPEC95, SPEC
CPU2000).
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real
programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific
aspect of performance characteristics: Processing:
integer, floating point, local memory, input/output, etc.
204521 Digital System Architecture
March 14, 2016
9
Types of Benchmarks
Cons
Pros
• Representative
• Portable.
• Widely used.
• Measurements
useful in reality.
Actual Target Workload
Full Application Benchmarks
• Easy to run, early in
the design cycle.
• Identify peak
performance and
potential bottlenecks.
204521 Digital System Architecture
Small “Kernel”
Benchmarks
Microbenchmarks
March 14, 2016
• Very specific.
• Non-portable.
• Complex: Difficult
to run, or measure.
• Less representative
than actual workload.
• Easy to “fool” by
designing hardware
to run them well.
• Peak performance
results may be a long
way from real application
performance
10
SPEC: System Performance
Evaluation Cooperative
The most popular and industry-standard set of CPU benchmarks.
SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs).
SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp,
wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a
score of SPECint95 = SPECfp95 = 1
SPEC CPU2000, 1999:
– CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive
programs)
– Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score
of SPECint2000 = SPECfp2000 = 100
204521 Digital System Architecture
March 14, 2016
11
SPEC First Round
800
700
600
SPEC Perf
One program:
99% of time in
single line of
code
New front-end
compiler could
improve
dramatically
500
400
300
200
100
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
gcc
0
Benchmark
204521 Digital System Architecture
March 14, 2016
12
Impact of Means
on SPECmark89 for IBM 550
(without and with special compiler option)
Ratio to VAX:
Program
gcc
espresso 35
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Mean
Before
30
34
47
46
78
34
40
78
90
33
54
After
29
65
47
49
144
34
40
730
87
138
72
Geometric
Ratio
1.33
204521 Digital System Architecture
Time:
Before
49
67
510
41
258
183
28
58
34
20
124
After
51
7.64
510
38
140
183
28
6
35
19
108
Arithmetic
Ratio
1.16
March 14, 2016
Weighted Time:
Before
8.91
7.86
5.69
5.81
3.43
7.86
6.68
3.43
2.97
2.01
54.42
After
9.22
5.69
5.45
1.86
7.86
6.68
0.37
3.07
1.94
49.99
Weighted Arith.
Ratio
1.09
13
SPEC CPU2000 Programs
CINT2000
(Integer)
CFP2000
(Floating
Point)
Benchmark Language
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
Descriptions
C
C
C
C
C
C
C++
C
C
C
C
C
Fortran 77
Fortran 77
Fortran 77
Fortran 77
C
Fortran 90
C
C
Fortran 90
C
Fortran 90
Fortran 90
Fortran 77
Fortran 77
Programs application domain: Engineering and scientific computation
204521 Digital System Architecture
March 14, 2016
Compression
FPGA Circuit Placement and Routing
C Programming Language Compiler
Combinatorial Optimization
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
Physics / Quantum Chromodynamics
Shallow Water Modeling
Multi-grid Solver: 3D Potential Field
Parabolic / Elliptic Partial Differential Equations
3-D Graphics Library
Computational Fluid Dynamics
Image Recognition / Neural Networks
Seismic Wave Propagation Simulation
Image Processing: Face Recognition
Computational Chemistry
Number Theory / Primality Testing
Finite-element Crash Simulation
High Energy Nuclear Physics Accelerator Design
Meteorology: Pollutant Distribution
Source: http://www.spec.org/osg/cpu2000/
14
Top 20 SPEC CPU2000 Results (As of March 2002)
Top 20 SPECint2000
Top 20 SPECfp2000
#
MHz
Processor
int peak
int base
MHz
Processor
fp peak
1
1300
POWER4
814
790
1300
POWER4
1169
2
2200
Pentium 4
811
790
1000
Alpha 21264C
960
3
2200
Pentium 4 Xeon
810
788
1050
UltraSPARC-III Cu
827
4
1667
Athlon XP
724
697
2200
Pentium 4 Xeon
802
779
5
1000
Alpha 21264C
679
621
2200
Pentium 4
801
779
6
1400
Pentium III
664
648
833
Alpha 21264B
7
1050
UltraSPARC-III Cu
610
537
800
Itanium
701
701
8
1533
Athlon MP
609
587
833
Alpha 21264A
644
571
9
750
PA-RISC 8700
604
568
1667
Athlon XP
642
596
10
833
Alpha 21264B
571
497
750
PA-RISC 8700
581
526
11
1400
Athlon
554
495
1533
Athlon MP
547
504
12
833
Alpha 21264A
533
511
600
MIPS R14000
529
499
13
600
MIPS R14000
500
483
675
SPARC64 GP
509
371
14
675
SPARC64 GP
478
449
900
UltraSPARC-III
15
900
UltraSPARC-III
467
438
1400
Athlon
458
426
16
552
PA-RISC 8600
441
417
1400
Pentium III
456
437
17
750
POWER RS64-IV
439
409
500
PA-RISC 8600
440
397
18
700
Pentium III Xeon
438
431
450
POWER3-II
433
426
19
800
Itanium
365
358
500
Alpha 21264
422
383
20
400
MIPS R12000
353
328
400
MIPS R12000
407
382
784
482
fp base
1098
776
701
643
427
Source: http://www.aceshardware.com/SPECmine/top.jsp
204521 Digital System Architecture
March 14, 2016
15
Common Benchmarking Mistakes
(1/2)
Only average behavior represented in test
workload
Skewness of device demands ignored
Loading level controlled inappropriately
Caching effects ignored
Buffer sizes not appropriate
Inaccuracies due to sampling ignored
204521 Digital System Architecture
March 14, 2016
16
Common Benchmarking Mistakes
(2/2)
Ignoring monitoring overhead
Not validating measurements
Not ensuring same initial conditions
Not measuring transient (cold start)
performance
Using device utilizations for
performance comparisons
Collecting too much data but doing
too little analysis
204521 Digital System Architecture
March 14, 2016
17
Architectural Performance
Laws and Rules of Thumb
204521 Digital System Architecture
March 14, 2016
Measurement and Evaluation
Architecture is an iterative process:
– Searching the space of possible
designs
– Make selections
– Evaluate the selections made
Good measurement tools are
required to accurately evaluate the
selection.
204521 Digital System Architecture
March 14, 2016
19
Measurement Tools
Benchmarks, Traces, Mixes
Cost, delay, area, power estimation
Simulation (many levels)
– ISA, RTL, Gate, Circuit
Queuing Theory
Rules of Thumb
Fundamental Laws
204521 Digital System Architecture
March 14, 2016
20
Measuring and Reporting
Performance
What do we mean by one Computer is faster
than another?
– program runs less time
Response time or execution time
– time that users see the output
Elapsed time
– A latency to complete a task including disk accesses,
memory accesses, I/O activities, operating system overhead
Throughput
– total amount of work done in a given time
204521 Digital System Architecture
March 14, 2016
21
Performance
“Increasing and decreasing” ?????
We use the term “improve performance” or “
improve execution time” When we mean
increase performance and decrease
execution time .
improve performance = increase performance
improve execution time = decrease execution time
204521 Digital System Architecture
March 14, 2016
22
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
204521 Digital System Architecture
Megabytes per second
Cycles per second (clock rate)
March 14, 2016
Does Anybody Really Know
What Time it is?
UNIX Time Command: 90.7u 12.9s 2:39 65%
User CPU Time (Time spent in program)
System CPU Time (Time spent in OS)
Elapsed Time (Response Time = 159 Sec.)
(90.7+12.9)/159 * 100 = 65%, % of lapsed time that is
CPU time. 45% of the time spent in I/O or running
other programs
204521 Digital System Architecture
March 14, 2016
Example
UNIX time command
90.7u 12.95 2:39 65%
user CPU time is 90.7 sec
system CPU time is 12.9 sec
elapsed time is 2 min 39 sec. (159 sec)
% of elapsed time that is CPU time is 90.7 + 12.9 = 65%
159
204521 Digital System Architecture
March 14, 2016
25
Time
CPU time
– time the CPU is computing
– not including the time waiting for I/O or
running other program
User CPU time
– CPU time spent in the program
System CPU time
– CPU time spent in the operating system
performing task requested by the program
decrease execution time
CPU time = User CPU time + System CPU
time
204521 Digital System Architecture
March 14, 2016
26
Performance
System Performance
– elapsed time on unloaded system
CPU performance
– user CPU time on an unloaded system
204521 Digital System Architecture
March 14, 2016
27
Two notions of “performance”
Plane
DC to Paris
Speed
Passengers
Throughput
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
Which has higher performance?
° Time to do the task (Execution Time)
– execution time, response time, latency
° Tasks per day, hour, week, sec, ns. ..
– throughput, bandwidth
Response time and throughput often are in opposition
204521 Digital System Architecture
March 14, 2016
28
Example
• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”
• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”
• Boeing is 1.6 times (“60%”)faster in terms of throughput
• Concord is 2.2 times (“120%”) faster in terms of flying time
We will focus primarily on execution time for a single job
204521 Digital System Architecture
March 14, 2016
29
Computer Performance Measures:
Program Execution Time (1/2)
For a specific program compiled to
run on a specific machine (CPU) “A”,
the following parameters are
provided:
– The total instruction count of the
program.
– The average number of cycles per
instruction (average CPI).
– Clock cycle of machine “A”
204521 Digital System Architecture
March 14, 2016
30
Computer Performance Measures:
Program Execution Time (2/2)
How can one measure the performance
of this machine running this program?
– Intuitively the machine is said to be faster or has better
performance running this program if the total execution
time is shorter.
– Thus the inverse of the total measured program execution
time is a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve
performance?!!!!
204521 Digital System Architecture
March 14, 2016
31
Comparing Computer Performance
Using Execution Time
To compare the performance of two machines (or
CPUs) “A”, “B” running a given specific program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
Machine A is n times faster than machine B means
(or slower? if n < 1) :
PerformanceA
Speedup = n =
PerformanceB
204521 Digital System Architecture
=
March 14, 2016
Execution TimeB
Execution TimeA
32
Example
For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
Performanc e
Execution Time
A
B
Speedup 

Performanc e
Execution Time
B
A
10
  10
1
The performance of machine A is 10 times the performance of
machine B when running this program, or: Machine A is said to
be 10 times faster than machine B when running this
program.
The two CPUs may target different ISAs provided
the program is written in a high level language (HLL)
204521 Digital System Architecture
March 14, 2016
33
CPU Execution Time: The CPU
Equation
A program is comprised of a number of instructions executed , I
– Measured in:
instructions/program
The average instruction executed takes a number of cycles per
instruction (CPI) to be completed.
– Measured in: cycles/instruction, CPI
CPU has a fixed clock cycle time C = 1/clock rate
– Measured in:
seconds/cycle
CPU execution time is the product of the above three parameters as
follows:
CPU time = Seconds = Instructions x Cycles
Program
T =
execution Time
per program in seconds
204521 Digital System Architecture
Program
I
x Seconds
Instruction
x
CPI x
Number of
instructions executed
Average CPI for program
March 14, 2016
Cycle
C
CPU Clock Cycle
34
CPU Execution Time: Example
A Program is running on a specific machine with
the following parameters:
– Total executed instruction count: 10,000,000 instructions Average
CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds)
What is the execution time for this program:
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
CPU time = Instruction count x CPI x Clock cycle
= 10,000,000
x 2.5 x 1 / clock
rate
= 10,000,000
x 2.5 x 5x10-9
= .125 seconds
204521 Digital System Architecture
March 14, 2016
35
Aspects of CPU Execution Time
CPU Time = Instruction count x CPI x Clock cycle
Depends on:
T = I x CPI x C
Program Used
Compiler
ISA
Instruction Count I
(executed)
Depends on:
Program Used
Compiler
ISA
CPU Organization
204521 Digital System Architecture
CPI
(Average
CPI)
March 14, 2016
Clock
Cycle
C
Depends on:
CPU Organization
Technology (VLSI)
36
Factors Affecting CPU Performance
CPU time
= Seconds
= Instructions x Cycles
Program
Program
Instruction
Instruction
Count I
CPI
Program
X
X
Compiler
X
X
Instruction Set
Architecture (ISA)
X
X
X
Organization
(CPU Design)
Technology
Cycle
Clock Cycle C
X
X
(VLSI)
204521 Digital System Architecture
x Seconds
March 14, 2016
37
Performance Comparison:
Example
From the previous example: A Program is running on a
specific machine with the following parameters:
– Total executed instruction count, I: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
Using the same program with these changes:
– A new compiler used: New instruction count 9,500,000
New CPI: 3.0
– Faster CPU implementation: New clock rate = 300 MHZ
What is the speedup with the changes?
Speedup
= Old Execution Time = Iold x
New Execution Time Inew x
CPIold
x Clock cycleold
CPInew
x Clock Cyclenew
Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x
3.33x10-9 )
= .125 / .095 = 1.32
or 32 % faster after changes.
204521 Digital System Architecture
March 14, 2016
38
Instruction Types & CPI
Given a program with n types or classes of instructions executed
on
a given CPU with the following characteristics:
Ci = Count of instructions of typei
CPIi = Cycles per instruction for typei
i = 1, 2, …. n
Then:
CPI = CPU Clock Cycles / Instruction
Count I
Where:
n
CPU clock cycles  
i 1
CPI  C 
i
i
Instruction Count I = S Ci
204521 Digital System Architecture
March 14, 2016
39
Instruction Types & CPI: An Example
An instruction set has three instruction classes:
Instruction class
A
B
C
CPI
1
2
3
For a specific
CPU design
Two code sequences have the following instruction counts:
Code Sequence
1
2
Instruction counts for instruction class
A
B
C
2
1
2
4
1
1
CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5
204521 Digital System Architecture
March 14, 2016
40
Instruction Frequency & CPI
Given a program with n types or classes of
instructions with the following characteristics:
Ci
= Count of instructions of typei
CPIi = Average cycles per instruction of typei
Fi
= Frequency or fraction of instruction typei executed
= Ci/ total executed instruction count = Ci/ I
Then:
CPI 
 CPI
n
i 1
i
 F i
Fraction of total execution time for instructions of type i =
204521 Digital System Architecture
March 14, 2016
CPIi x Fi
CPI
41
Instruction Type Frequency & CPI:
A RISC Example
CPIi x Fi
Program Profile or Executed Instructions Mix
Base Machine (Reg / Reg)
Op
Freq, Fi CPIi
ALU
50%
1
Load
20%
5
Store
10%
3
Branch
20%
2
Typical Mix
n
CPI
CPIi x Fi
.5
1.0
.3
.4
% Time
23% = .5/2.2
45% = 1/2.2
14% = .3/2.2
18% = .4/2.2
Sum = 2.2
CPI   CPI i  F i 
i 1
204521 Digital System Architecture
March 14, 2016
42
Performance Terminology
“X is n% faster than Y” means:
ExTime(Y)
Performance(X)
--------- =
ExTime(X)
-------------= 1 + ----Performance(Y)
n
100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
204521 Digital System Architecture
March 14, 2016
43
Example
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
n = 100(15 - 10)
10
n = 50%
204521 Digital System Architecture
March 14, 2016
Speedup
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a
fractionenhanced of
the task by a factor Speedupenhanced , and the
remainder of
the task is unaffected, then what is
ExTime(E) = ?
Speedup(E) = ?
204521 Digital System Architecture
March 14, 2016
45
Amdahl’s Law
States that the performance improvement to
be gained from using some faster mode of
execution is limited by the fraction of the time
faster mode can be used
Speedup =
Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup =
Execution time without the enhancement
Execution time for entire task using the enhancement
204521 Digital System Architecture
March 14, 2016
46
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
204521 Digital System Architecture
March 14, 2016
Example of Amdahl’s Law
Floating point instructions improved to
run 2X; but only 10% of actual
instructions are FP
ExTimenew =
Speedupoverall =
204521 Digital System Architecture
March 14, 2016
Example of Amdahl’s Law
Floating point instructions improved to
run 2X; but only 10% of actual
instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
204521 Digital System Architecture
1
0.95
March 14, 2016
=
1.053
Performance Enhancement
Calculations: Amdahl's Law
The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E
Speedup(E) = -------------------------------------Execution Time with E
Performance with E
= --------------------------------Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution time
by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E
1
Speedup(E) = --------------------------------------------------------- = -------------------((1 - F) + F/S) X Execution Time without E
(1 - F) + F/S
204521 Digital System Architecture
March 14, 2016
50
Pictorial Depiction of Amdahl’s Law
Enhancement E accelerates fraction F of original execution time by a factor of S
Before:
Execution Time without enhancement E: (Before enhancement is applied)
• shown normalized to 1 = (1-F) + F =1
Unaffected fraction: (1- F)
Affected fraction: F
Unchanged
Unaffected fraction: (1- F)
F/S
After:
Execution Time with enhancement E:
Execution Time without enhancement E
1
Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E
(1 - F) + F/S
204521 Digital System Architecture
March 14, 2016
51
Performance Enhancement Example
For the RISC machine with the following instruction mix given earlier:
Op
Freq
Cycles
CPI(i)
% Time
ALU
50%
1
.5
23%
Load
20%
5
1.0
45%
Store
10%
3
.3
14%
Branch
20%
2
.4
18%
If a CPU design enhancement improves the CPI of load instructions from 5
to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 100% - 45% = 55% or .55
Factor of enhancement = 5/2 = 2.5
Using Amdahl’s Law:
1
1
Speedup(E) = ------------------ = --------------------- =
(1 - F) + F/S
.55 + .45/2.5
204521 Digital System Architecture
March 14, 2016
1.37
52
An Alternative Solution Using CPU
Equation
Op
Freq
Cycles
CPI(i)
% Time
ALU
50%
1
.5
23%
Load
20%
5
1.0
45%
Store
10%
3
.3
14%
Branch
20%
2
.4
18%
If a CPU design enhancement improves the CPI of load instructions from 5 to 2,
what is the resulting performance improvement from this enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time
Speedup(E) = ----------------------------------New Execution Time
Instruction count x old CPI x clock cycle
= ---------------------------------------------------------------Instruction count x new CPI x clock cycle
old CPI
= ------------ =
new CPI
2.2
--------1.6
= 1.37
Which is the same speedup obtained from Amdahl’s Law in the first solution.
204521 Digital System Architecture
March 14, 2016
53
Performance Enhancement Example
(1/2)
A program runs in 100 seconds on a machine with multiply
operations responsible for 80 seconds of this time. By how much
must the speed of multiplication be improved to make the program
four times faster?
Desired speedup = 4 =
100
----------------------------------------------------Execution Time with enhancement
 Execution time with enhancement = 25 seconds


25 seconds = (100 - 80 seconds) + 80 seconds / n
25 seconds =
20 seconds
+ 80 seconds / n
5 = 80 seconds / n
n = 80/5 = 16
Hence multiplication should be 16 times faster to get a speedup of 4.
204521 Digital System Architecture
March 14, 2016
54
Performance Enhancement Example
(2/2)
For the previous example with a program running in 100
seconds on a machine with multiply operations responsible for
80 seconds of this time. By how much must the speed of
multiplication be improved to make the program five times
faster?
Desired speedup = 5 =


100
----------------------------------------------------Execution Time with enhancement
Execution time with enhancement = 20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / n
20 seconds =
20 seconds
+ 80 seconds / n
0 = 80 seconds / n
No amount of multiplication speed improvement can
achieve this.
204521 Digital System Architecture
March 14, 2016
55
Extending Amdahl's Law To Multiple
Enhancements
Suppose that enhancement Ei accelerates a
fraction Fi of the execution time by a factor Si
and the remainder of the time is unaffected then:
Original Execution Time
Speedup 
(1  i F i )  i F i XOriginal Execution Time
(
Speedup 
S
)
i
1
((1   F )   F )
i
i
i
i
S
i
Note: All fractions Fi refer to original execution time before the
enhancements are applied.
.
204521 Digital System Architecture
March 14, 2016
56
Amdahl's Law With Multiple
Enhancements: Example
Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10
Percentage1 = F1 = 20%
Speedup2 = S2 = 15
Percentage1 = F2 = 15%
Speedup3 = S3 = 30
Percentage1 = F3 = 10%
While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one enhancement
can be used at a time.
What is the resulting overall speedup?
Speedup 
1
((1   F )   F )
i
i
i
i
S
i
Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1/ [
.55
+
.0333
]
= 1 / .5833 = 1.71
204521 Digital System Architecture
March 14, 2016
57
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
Unaffected, fraction: .55
S1 = 10
F1 = .2
/ 10
S2 = 15
S3 = 30
F2 = .15
F3 = .1
/ 15
/ 30
Unchanged
Unaffected, fraction: .55
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions (Fi , i = 1, 2, 3) refer to original execution time.
204521 Digital System Architecture
March 14, 2016
58
Means
Arithmetic
mean
1 n
Ti

n i 1
Harmonic
mean
n
n
1

i 1 R i
Can be
weighted.
Represents
total execution
time
1
n
Geometric
mean
1 n
 n Ti 
 Ti  

    exp   log  
 i 1 Tri 
 Tri  
 n i 1
204521 Digital System Architecture
March 14, 2016
Consistent
independent
of reference
61
How to Summarize Performance
(1/2)
Arithmetic mean (weighted arithmetic
mean) tracks execution time:
 T or W *T
i
i
n
i
Harmonic mean (weighted harmonic
mean) of rates (e.g., MFLOPS) tracks
execution time:
n
1

204521 Digital System Architecture
or
Ri

n
Wi
March 14, 2016
Ri
62
How to Summarize Performance
(2/2)
Normalized execution time is handy for
scaling performance (e.g., X times faster
than SPARCstation 10)
– Arithmetic mean impacted by choice of
reference machine
Use the geometric mean for comparison:

n
Ti
– Independent of chosen machine
– but not good metric for total execution time
204521 Digital System Architecture
March 14, 2016
63
Summarizing Performance
Arithmetic mean
– Average execution time
– Gives more weight to longer-running programs
Weighted arithmetic mean
– More important programs can be emphasized
– But what do we use as weights?
– Different weight will make different machines look better
• (see Figure 1.16 in your text book – Hennessey/Patterson
3rd Edition)
• (see board for extension of previous example)
204521 Digital System Architecture
March 14, 2016
64
Normalizing & the Geometric Mean
Normalized execution time
– Take a reference machine R
– Time to run A on X / Time to run A on R
Geometric mean
n
n
 Normalized
execution time on i
i 1
Neat property of the geometric mean:
Consistent whatever the reference machine
Do not use the arithmetic mean for normalized
execution times
204521 Digital System Architecture
March 14, 2016
65
Which Machine is “Better”?
Computer A
Program P1(sec) 1
Program P2
1000
Total Time
1001
204521 Digital System Architecture
Computer B
10
100
110
March 14, 2016
Computer C
20
20
40
Weighted Arithmetic Mean
Assume three weighting schemes
P1/P2
.5/.5
.909/.091
.999/.001
Comp A
500.5
91.82
2
204521 Digital System Architecture
Comp B
55.0
18.8
10.09
March 14, 2016
Comp C
20
20
20
Performance Evaluation
Given sales is a function of performance relative to
the competition, big investment in improving
product as reported by performance summary
Good products created then have:
– Good benchmarks
– Good ways to summarize performance
If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
Execution time is the measure of performance
204521 Digital System Architecture
March 14, 2016
Computer Performance Measures :
MIPS Rating (1/3)
For a specific program running on a specific CPU the MIPS rating
is a measure of how many millions of instructions are executed
per second:
MIPS Rating = Instruction count / (Execution Time x 106)
= Instruction count / (CPU clocks x Cycle time x 106)
= (Instruction count x Clock rate) / (Instruction
count x CPI x 106)
= Clock rate / (CPI x 106)
Major problem with MIPS rating: As shown above the MIPS rating
does not account for the count of instructions executed (I).
– A higher MIPS rating in many cases may not mean higher
performance or better execution time. i.e. due to compiler design
variations.
204521 Digital System Architecture
March 14, 2016
69
Computer Performance Measures :
MIPS Rating (2/3)
In addition the MIPS rating:
– Does not account for the instruction set architecture
(ISA) used.
• Thus it cannot be used to compare computers/CPUs with
different instruction sets.
– Easy to abuse: Program used to get the MIPS rating
is often omitted.
• Often the Peak MIPS rating is provided for a given CPU
which is obtained using a program comprised entirely of
instructions with the lowest CPI for the given CPU design
which does not represent real programs.
204521 Digital System Architecture
March 14, 2016
70
Computer Performance Measures :
MIPS Rating (3/3)
Under what conditions can the MIPS rating be used to compare
performance of different CPUs?
The MIPS rating is only valid to compare the performance of
different CPUs provided that the following conditions are
satisfied:
1 The same program is used
(actually this applies to all performance metrics)
2 The same ISA is used
3 The same compiler is used
 (Thus the resulting programs used to run on the CPUs
and obtain the MIPS rating are identical at the machine
code level including the same instruction count)
204521 Digital System Architecture
March 14, 2016
71
Compiler Variations, MIPS,
Performance: An Example (1/2)
For the machine with instruction classes:
Instruction class
CPI
A
1
B
2
C
3
For a given program two compilers produced the following
instruction counts:
Code from:
Compiler 1
Compiler 2
Instruction counts (in millions)
for each instruction class
A
B
C
5
1
1
10
1
1
The machine is assumed to run at a clock rate of 100 MHz
204521 Digital System Architecture
March 14, 2016
72
Compiler Variations, MIPS,
Performance: An Example (2/2)
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
CPI = CPU execution cycles / Instructions count
n
CPU clock cycles  
i 1
CPI  C 
i
i
CPU time = Instruction count x CPI / Clock rate
For compiler 1:
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIP Rating1 = 100 / (1.428 x 106) = 70.0
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds
For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIPS Rating2 = 100 / (1.25 x 106) = 80.0
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds
204521 Digital System Architecture
March 14, 2016
73
Computer Performance
Measures :MFLOPS (1/2)
A floating-point operation is an addition,
subtraction, multiplication, or division operation
applied to numbers represented by a single or a
double precision floating-point representation.
MFLOPS, for a specific program running on a
specific computer, is a measure of millions of
floating point-operation (megaflops) per second:
MFLOPS = Number of floating-point operations /
(Execution time x 106 )
MFLOPS rating is a better comparison measure
between different machines (applies even if ISAs
are different) than the MIPS rating.
– Applicable even if ISAs are different
204521 Digital System Architecture
March 14, 2016
74
Computer Performance
Measures :MFLOPS (2/2)
Program-dependent: Different programs
have different percentages of floatingpoint operations present. i.e compilers
have no floating- point operations and
yield a MFLOPS rating of zero.
Dependent on the type of floating-point
operations present in the program.
– Peak MFLOPS rating for a CPU: Obtained using a program
comprised entirely of the simplest floating point instructions
(with the lowest CPI) for the given CPU design which does
not represent real floating point programs.
204521 Digital System Architecture
March 14, 2016
75
Other ways to measure
performance
(1) Use MIPS (millions of instructions/second)
MIPS =
Instruction Count
Exec. Time x
=
106
Clock Rate
CPI x 106
MIPS is a rate of operations/unit time.
Performance can be specified as the inverse of
execution time so faster machines have a higher
MIPS rating
So, bigger MIPS = faster machine. Right?
204521 Digital System Architecture
March 14, 2016
76
Wrong!!!
3 significant problems with using MIPS:
– Problem 1:
• MIPS is instruction set dependent.
• (And different computer brands usually have different
instruction sets)
– Problem 2:
• MIPS varies between programs on the same computer
– Problem 3:
• MIPS can vary inversely to performance!
Let’s look at an example of why MIPS doesn’t
work…
204521 Digital System Architecture
March 14, 2016
77
A MIPS Example (1)
Consider the following computer:
Instruction counts (in millions) for each
instruction class
Code from:
A
B
C
Compiler 1
5
1
1
Compiler 2
10
1
1
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction B requires 2
clock cycles, Instruction C requires 3 clock cycles.
n
!
Note
important CPI =
formula!
S
CPU Clock Cycles
CPIi x Ci
i =1
=
Instruction Count
204521 Digital System Architecture
March 14, 2016
Instruction Count
78
A MIPS Example (2)
count
CPI1 =
[(5x1) + (1x2) + (1x3)] x 106
(5 + 1 + 1) x 106
MIPS1 =
CPI2 =
cycles
100 MHz
1.43
cycles
= 69.9
[(10x1) + (1x2) + (1x3)] x 106
MIPS2 =
(10 + 1 + 1) x 106
100 MHz
1.25
204521 Digital System Architecture
= 10/7 = 1.43
= 80.0
= 15/12 = 1.25
So, compiler 2 has a higher
MIPS rating and should be
faster?
March 14, 2016
79
A MIPS Example (3)
Now let’s compare CPU time:
!
Note
important
formula!
CPU Time =
CPU Time1 =
CPU Time2 =
Instruction Count x CPI
Clock Rate
7 x 106 x 1.43
100 x 106
12 x 106 x 1.25
100 x 106
= 0.10 seconds
= 0.15 seconds
Therefore program 1 is faster despite a lower MIPS!
204521 Digital System Architecture
March 14, 2016
80
Download