Lecture 1 - Performance - data engineering . org

advertisement
Lecture 1: Performance
EEN 312: Processors:
Hardware, Software, and
Interfacing
Department of Electrical and Computer Engineering
Spring 2013, Dr. Rozier (UM)
PERFORMANCE TRENDS
Growth in Processor Performance
since 1978.
Logarithmic Scale!
Moore’s Law
• Gordon Moore – One of the founders of Intel
– Famously predicted in 1960 that the transistor
capacity of integrated circuits would double every
18-24 months.
– Not really a law, but has largely held true.
– Generally translates into increased performance,
and decreased cost.
Moore’s Law
Exponential Growth
How do we get to Performance?
• Does more transistors really mean more
performance?
• Is it a one-to-one correlation?
• How might transistors NOT correlate to
increased performance?
MEASURING PERFORMANCE
A simple example
• Say we have two computers. You know one is
rated at 1GHz and another is rated at 800MHz.
• Which computer has a higher performance?
A simple example?
• What do GHz and MHz even mean?
• What else could differ about the machines?
• What else could differ about the context of
performance?
THE SITUATION IS A COMPLEX ONE!
First, Some Measure Theory
• What is a measure? Formally?
– A way of assigning numbers to the subsets of
some set, which can be said (intuitively) to be the
size of the set.
– Measures require measurable spaces, and
measurable sets.
– Not all sets are measurable!
Measurable Sets/Spaces
• One reason a space or set may be
unmeasurable is if it is ill-defined.
Which Plane has a Higher
Performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
Cruising Speed (mph)
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
1500
0
100000 200000 300000 400000
Passengers x mph
Defining Performance
• We can define performance in several ways.
• Response time
– How long does it take to accomplish a task?
– We send input to a black box, and measure how
long it takes to get out output.
Defining Performance
• We can define performance in several ways.
• Throughput
– How much work gets done during a certain
amount of time?
– Watch a system, count the number of jobs
finished during a certain amount of time.
Throughput Example
• What is the fastest way you can think to
deliver a large amount of data?
• Never underestimate the throughput of a
Mack Truck loaded with hard drives!
What’s the Response time of our
Truck?
Response time as Execution Time
• Start a program, wait for it to return results.
Comparing Performance
• Given the performance or execution time of a
computer (A) and a different computer (B)
running the same program, we can compare
performance.
Comparing Performance
• Relative performance
Why is Relative Performance
Important?
So How Do We Measure
Performance
• First let’s define performance:
– Execution time
• What is our measurable space?
• What is our measurable set?
Measuring Execution Time
• CPU execution time
• Wall clock time
• How might these differ?
Measuring Execution Time
• Clock cycles
• Instruction count
Clock Cycles
• Clock period – duration of a clock cycle
• Clock frequency – number of cycles per
second
Clock period
Clock (cycles)
Data transfer
and computation
Update state
CPU Time
• We can improve performance by
– Reducing the number of clock cycles
– Increasing clock rate
– Often there is a trade-off
CPU Time  CPU Clock Cycles Clock Cycle Time
CPU Clock Cycles

Clock Rate
CPU Example
• Computer A: 2 GHz clock, 10s CPU time
• Computer B
– Aim for 6s CPU time. If you increase clock speed,
the number of cycles increase by 1.2x.
Break Into Groups
Find the necessary clock rate for Computer B
CPU Example
• Computer A: 2 GHz clock, 10s CPU time
• Computer B
– Aim for 6s CPU time. If you increase clock speed,
the number of cycles increase by 1.2x.
Clock Rate B 
Clock CyclesB 1.2  Clock CyclesA

CPU Time B
6s
Clock CyclesA  CPU Time A  Clock Rate A
 10s  2GHz  20  109
1.2  20  109 24  109
Clock Rate B 

 4GHz
6s
6s
Instruction Count and CPI
• Instruction count
– How many instructions the program has
• Depends on the ISA and compiler
• CPI
– Cycles per instruction
• Determined by hardware
Clock Cycles  Instruction Count  Cycles per Instruction
CPU Time  Instruction Count  CPI  Clock Cycle Time
Instruction Count  CPI

Clock Rate
CPI Example
•
•
•
•
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster? By how much?
Break Into Groups
CPI Example
•
•
•
•
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster? By how much?
CPU Time  Instruction Count  CPI  Cycle Time
A
A
A
 I  2.0  250ps  I  500ps
A is faster…
CPU Time  Instruction Count  CPI  Cycle Time
B
B
B
 I  1.2  500ps  I  600ps
CPU Time
B  I  600ps  1.2
CPU Time
I  500ps
A
…by this much
CPI Detail
• Sometimes different instructions take differing
amounts of time.
n
Clock Cycles   (CPI i  Instruction Count i )
i1
• Often we will want to weight by instruction
proportion in a program.
n
Clock Cycles
Instruction Count i 

CPI 
   CPI i 

Instruction Count i1 
Instruction Count 
Relative frequency
CPI Example
• Have instruction classes A, B, and C. Two was
to compile our code:
Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1
Give the average CPI for each program
CPI Example
Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1

Sequence 1: IC = 5


Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6


Clock Cycles
= 4×1 + 1×2 + 1×3
=9
Avg. CPI = 9/6 = 1.5
Performance Summary
• Performance depends on
– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
Instructions Clock cycles Seconds
CPU Time 


Program
Instruction Clock cycle
So Why Don’t We Have 1THz
Computers?
The Power Wall
• In CMOS IC technology
Pow er  Capacitive load  Voltage2  Frequency
×30
5V → 1V
×1000
The Power Wall
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction
Pnew Cold  0.85  (Vold  0.85)2  Fold  0.85
4


0.85
 0.52
2
Pold
Cold  Vold  Fold

The power wall



We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
Multiprocessors
• Multicore microprocessors
– More than one processor per chip
• Requires explicitly parallel programming
– Compare with instruction level parallelism
• Hardware executes multiple instructions at
once
• Hidden from the programmer
– Hard to do
• Programming for performance
• Load balancing
• Optimizing communication and synchronization
Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
Timproved 

Taffected
 Tunaffected
improvemen t factor
Example: multiply accounts for 80s/100s

How much improvement in multiply performance to
get 5× overall?
Break into Groups!
Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
Timproved 

Example: multiply accounts for 80s/100s


Taffected
 Tunaffected
improvemen t factor
How much improvement in multiply performance to
get 5× overall?
80
 Can’t be done!
20 
 20
n
Corollary: make the common case fast
PROBLEM SETS
Consider the following processors, P1, P2, and
P3 executing the same instruction set with clock
rates and CPI as indicated
1.
2.
3.
Processor
Clock Rate
CPI
P1
3 GHz
1.5
P2
2.5 GHz
1.0
P3
4 GHz
2.2
Which processor has the highest performance in terms of instructions
per second?
If the processors each execute a program in 10s, find the number of
cycles and the number of instructions
We are trying to reduce the execution time by 30% but this leads to an
increase in CPI of 20%. What clock rate should we have to get this
reduction?
Consider a computer running code with four
main routines, A, B, C, and D.
Routine A
40s
1.
2.
3.
Routine B
90s
Routine C
60s
Routine D
20s
Total Time
210s
How much is the total time reduced if the time for Routine A is reduced
by 20%?
How much is the time for Routine B reduced if the total time is reduced
by 20%?
Can the total time be reduced by 20% by only reducing the time for
Routine D?
Consider a computer running code with four
main routines, A, B, C, and D.
Routine A
Routine B
Routine C
Routine D
Total Time
Exec Time
40s
90s
60s
20s
210s
Instructions
50x10^6
110x10^6
80x10^6
16x10^6
-
Avg CPI
1
1
4
2
-
1.
2.
3.
How much is the total time reduced if the time for Routine A is reduced
by 20%?
How much is the time for Routine B reduced if the total time is reduced
by 20%?
Can the total time be reduced by 20% by only reducing the time for
Routine D?
Consider a computer running code with four
main routines, A, B, C, and D.
Routine A
Routine B
Routine C
Routine D
Total Time
Exec Time
40s
90s
60s
20s
210s
Instructions
50x10^6
110x10^6
80x10^6
16x10^6
-
Avg CPI
1
1
4
2
-
1.
2.
3.
How much must we improve the CPI of Routine A if we want the
program to run twice as fast?
How much must we improve the CPI of Routine C if we want the
program to run twice as fast?
How much is the execution time improved if the CPI of routines A and B
are reduced by 40%, and the CPI of routines C and D are reduced by
30%?
WRAP UP
For next time
• Read Chapter 2, Sections 2.1 – 2.3
• Finish Lab 0 by next lab session.
Download