ppt - ECE Users Pages

advertisement
Performance
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
Reading
• Section 1.6
• Practice Problems: Module 5 – 20, 21, 27
(2)
Understanding Performance
• Algorithm
 Determines number of operations executed
• Programming language, compiler, architecture
 Determine number of machine instructions executed
per operation
Instruction Set Architecture
• Processor and memory system
 Determine how fast instructions are executed
• I/O system (including OS)
 Determines how fast I/O operations are executed
(3)
Defining Performance
• Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud Concorde
BAC/Sud Concorde
Douglas DC-8-50
Douglas DC-8-50
0
200
400
600
0
200 400 600 800 1E+
0
0
0
0
04
Cruising Range (miles)
0
1E+0 2E+0 3E+0 4E+0
5
5
5
5
Passengers x mph
Passenger Capacity
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud Concorde
BAC/Sud Concorde
Douglas DC-8-50
Douglas DC-8-50
0
500
1000
Cruising Speed (mph)
1500
(4)
Metrics
• Response time (latency)
 How long it takes to do a task
• Throughput
 Total work done per unit time
o e.g., tasks/transactions/… per hour
 Trading throughput vs. latency
• Energy/Power




Measure of work being performed
Increases with clock frequency/voltage
Determines temperature
Is affected by temperature
(5)
Relative Performance
•
“X is n time faster than Y”
Performance X Performance Y
 Execution time Y Execution time X  n

Example: time taken to run a program



10s on A, 15s on B
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
So A is 1.5 times faster than B
(6)
CPU Clocking
• Operation of digital hardware governed by a
constant-rate clock
Our cycle time
Clock period
Clock (cycles)
Data transfer
and computation
Update state

Clock period: duration of a clock cycle


e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second

e.g., 4.0GHz = 4000MHz = 4.0×109Hz
(7)
CPU Time
CPU Time  CPU Clock Cycles Clock Cycle Time
CPU Clock Cycles

Clock Rate
• Performance improved by
 Reducing number of clock cycles
 Increasing clock rate
 Hardware designer must often trade off clock rate
against cycle count
(8)
CPU Time Example
• Computer A: 2GHz clock, 10s CPU time
• Designing Computer B
 Aim for 6s CPU time
 Can do faster clock, but causes 1.2 × clock cycles
• How fast must Computer B clock be?
Clock RateB =
Clock CyclesB 1.2 ´ Clock Cycles A
=
CPU TimeB
6s
Clock Cycles A = CPU Time A ´ Clock Rate A
= 10s ´ 2GHz = 20 ´109
1.2 ´ 20 ´109 24 ´109
Clock RateB =
=
= 4GHz
6s
6s
(9)
Instruction Count and CPI
Clock Cycles =Instruction Count ´Cycles per Instruction
• Instruction Count for a program
 Determined by program, ISA and compiler
• Average cycles per instruction
 Determined by CPU hardware
 If different instructions have different CPI
o Average CPI affected by instruction mix
(10)
Cycles and Instructions
time
• Multiplication takes more time than addition
• Floating point operations take longer than integer
ones
• Accessing memory takes (in general) more time
than accessing registers
•
Important point: changing the cycle time often changes the
number of cycles required for various instructions (more later)
(11)
Program Execution time
Number of instruction classes
én
ù
ExecutionTime = ê å Ci ´CPIi ú ´cycle_time
êëi=1
úû
~= Instruction_count * CPIavg * clock_cycle_time
algorithms/compiler
architecture
technology
n
æ
Clock Cycles
Instruction Counti ö
CPIavg =
= åçCPIi ´
÷
Instruction Count i=1 è
Instruction Count ø
Relative frequency
(12)
CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Same ISA
• Which is faster, and by how much?
CPU Time = Instruction Count ´CPI ´Cycle Time
A
A
A
A is faster…
= I´ 2.0´ 250ps = I´500ps
CPU Time = Instruction Count ´CPI ´Cycle Time
B
B
B
= I´1.2´500ps =I´ 600ps
CPU Time
B = I´ 600ps =1.2
…by this much
CPU Time
I´500ps
A
(13)
CPI Example
• Alternative compiled code sequences using
instructions in classes A, B, C

Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1
Sequence 1: IC = 5
 Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
 Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6
 Clock Cycles
= 4×1 + 1×2 + 1×3
=9
 Avg. CPI = 9/6 = 1.5
Example:
(14)
Performance Summary
The BIG Picture
Instructions Clock cycles Seconds
CPU Time 


Program
Instruction Clock cycle
• Performance depends on




Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, Tc
(15)
SPEC CPU Benchmark
• Programs used to measure performance
 Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
 Elapsed time to execute a selection of programs
o Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of
performance ratios
o CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i
i1
(16)
Other Benchmark Suites
• Report performance metrics for execution on
target platforms
 Designed to assess how well the platforms function in
specific domains
• Examples





Media Bench - Multimedia
EEMBC – Embedded systems
Rodinia, Parboil: For GPU Systems
SPECWeb, SPECJbb – Enterprise systems
Many more……
(17)
Pitfall: Amdahl’s Law
• Improving an aspect of a computer and
expecting a proportional improvement in
overall performance
Timproved

Taffected

 Tunaffected
improvemen t factor
Example: multiply accounts for 80s/100s

How much improvement in multiply performance to
get 5× overall?
80
20 
 20
n


Can’t be done!
Corollary: make the common case fast
(18)
Amdahl’s Law
• Speed-up = Exec_timeold / Exec_timenew =
1
f
affected
(1  f ) 
P
•
Performance improvement from using faster mode is
limited by the fraction the faster mode can be applied.
Told
f
(1 - f)
Tnew
(1 - f)
f/P
(19)
Concluding Remarks
• Cost/performance is improving
 Due to underlying technology development
• Hierarchical layers of abstraction
 In both hardware and software
• Instruction set architecture
 The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
 Use parallelism to improve performance
(20)
Study Guide
• Practice problems provided on the class
website
(21)
Glossary
• Amdahl’s Law
• Latency
• Benchmarks
• SPEC CPU
• CPI (cycles per
instruction)
• Throughput
• CPU Time
(22)
Download