Fundamentals of Design

advertisement
Computer Architecture
Part I-C: Performance
What does faster mean?

Response time



The time spent to complete an event
Also referred to as execution time or
latency
Throughput



Amount of work done in a given time
Also referred to as bandwidth
In general, faster response time means
an improvement in throughput
Execution Time and Performance

Quantitatively, execution time is
inversely proportional to performance.



improve performance = increase
performance
improve execution time = decrease
execution time
X is n times faster than Y means
PX
n 
PY
tY

tX
Make the Common Case Fast

A rule of thumb in computer design is
to make the event that occurs more
frequently, faster


In making a design trade-off, favor the
frequent case over the infrequent case
In general, this move should increase
overall performance
Amdahl’s Law


The performance improvement to be
gained from using some faster mode
of operations is limited by the fraction
of time that faster mode can be used.
Speedup due to enhancement E
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
(for an entire task)
=
Performance w/ E
------------------Performance w/o E
Factors Affecting the Speedup

The fraction of computation time in the
original machine that can be converted to
take advantage of the enhancement
Fraction enhanced  1

The improvement gained by the enhanced
execution mode, i.e. how much faster the
task would run if the enhanced mode were
used for the entire program.
Speedup enhanced  1
Applying Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Using Amdahl’s Law: An Example
Suppose that we are considering an enhancement
that runs 10 times faster than the original machine
but is only usable 40% of the time. What is the
overall speedup gained by incorporating the
enhancement?
Answer:
Fractionenhanced = 0.4
Speedupenhanced = 10
Speedupoverall = 1 / [0.6+(0.4/10)] = 1/0.64 = 1.56
Measuring CPU Processing
Speed: The Clock


A circuit which generates a signal that
defines regular time intervals or cycles
during which basic CPU steps are
performed
Provides control as to when each step
of the instruction cycle takes place
Clock Cycles





One clock pulse is the burst
of current when the clock
output is equal to 1
A clock cycle is the interval
between the beginning of a
pulse to the beginning of the
next
Measured in Hertz, a unit of pulse
measurement of electrical
vibrations.
I Hz = 1 cycle/second
Basic unit of CPU speed = 1
million Hz or 1 MHz
cycle
Locality of Reference


Programs tend to reuse data and
instructions they have used recently.
A program may spend 90% of its
execution time in only 10% of the
code.

Based on a program’s recent past, one
can predict with reasonable accuracy
what instructions and data will use in the
near future.
Two Types of Locality

Temporal Locality


recently accessed items are likely to be
accessed in the near future
Spatial Locality

items whose addresses (or location) are
near one another tend to be referenced
close together in time
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Megabytes per second
Cycles per second (clock rate)
MIPS Benchmark





Millions of Instructions Per Second
Easy to understand and
straightforward
Dependent on instruction set
Varies between programs on the same
computer
MIPS can vary inversely with
performance!
MFLOPS Benchmark




Millions of Floating-point Operations Per
Second (MegaFLOPS)
Intended to measure floating-point
operations but some programs don’t use
any
Floating-point operations are not consistent
across machines
MFLOPS ratings for the same machine may
differ depending on instruction mix
Programs as Evaluators

Four types (in decreasing order of
accuracy):




Real programs
Kernels
Toy Benchmarks
Synthetic Benchmarks
Synthetic Benchmarks



Programs which try to match the
average number and frequency of
operations of a typical workload, e.g.
dhrystone, whetstone, etc.
Not real programs, may not reflect
program behavior for factors not
measured
Compilers and hardware optimizations
can artificially inflate results
Toy Benchmarks


Small, simple programs
Produce a result the user already
knows

Example: quicksort, Sieve of
Erastosthenes, etc.
Kernel Benchmarks

Small, key pieces from real programs put
together to evaluate machine performance



Examples: Linpack, Livermore Loops, etc.
No user would run kernel programs because
they exist solely for performance evaluation
Best used to isolate performance of
individual features of machines to explain
the reasons for differences in real programs
Real Programs


Common programs like compilers (e.g.
C), word processors (e.g. TeX, MS
Word), computer-aided design tools
(e.g. Spice), etc.
Real programs have the input, output,
and options that a user can select.
When Benchmarks Disagree
CP
Um
ar
k3
2
SP
EC
fp
95
SP
EC
in
t9
No
5
rto
No
n
rto
SI
n
32
M
ul
tim
ed
ia
In
te
lM
ed
ia
ZD
BO
p
source: adapted from
Byte April 1998
Improvement
What is
MMX’s
real
speed?
1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Popular Benchmarks








Bapco SYSmark - application, tests system
BYTEmark - synthetic, tests processor
Intel Media - synthetic, tests processor
(multimedia, uses MMX instructions)
CaffeineMark - synthetic, tests JVM
SPEC CPU95 - synthetic, tests processor
(two suites: integer and floating-point)
SPEC Glperf - synthetic, tests 3-D graphics
SPEC Viewperf - application, 3-D graphics
Norton Multimedia - synthetic, tests system
(multimedia, uses MMX instructions)
Popular Benchmarks



TPC-C (Transaction Processing Council) - database
application, tests transaction-processing
performance
TPC-D - database application, tests decision
support and data-warehousing performance
ZDBOp (Ziff-Davis Benchmark Operation):







BrowserComp - application, tests browsers
CPUmark32 - synthetic, tests processor
NetBench - application, tests network performance
ServerBench - application, tests server performance
WebBench - application, tests web server
WinBench - application, tests component subsystems
Winstone - application, tests system
Programs as Evaluators


Companies may design features that would
make their machines run faster on the
benchmarks than on real programs
A standard set of programs is hard to obtain
because each program run differently for
each machine and companies would want to
use programs that run fast on their
machines
Download