Uploaded by TES

Week 2-1

advertisement
Week 2, Lecture 1
Chapter 2: Performance Issues
Dr. Qurban Ali , EE Department
1
Diminishing Returns
o Internal organization of processors is complex
n Can get a great deal of parallelism
n Further significant increases likely to be
relatively modest
o Benefits from cache are reaching limit
o Increasing clock rate runs into power dissipation
problem
n Some fundamental physical limits are being
reached
Dr. Qurban Ali , EE Department
2
Microprocessor Speeding up
Techniques built into contemporary processors include:
Pipelining
Branch
prediction
• Processor moves data or instructions into
a conceptual pipe with all stages of the
pipe processing simultaneously
• Processor looks ahead in the instruction code
fetched from memory and predicts which
branches, or groups of instructions, are likely
to be processed next
Data flow
analysis
• Processor analyzes which instructions are
dependent on each other’s results, or
data, to create an optimized schedule of
instructions
Speculative
execution
• Using branch prediction and data flow
analysis, some processors speculatively
execute instructions ahead of their actual
appearance in the program execution,
holding the results in temporary locations,
keeping execution engines as busy as
possible
Performance Improvement Example:
Increased Cache Capacity
o Typically two or three levels of cache between
processor and main memory
o Chip density increased
n More cache memory on chip
o Faster data access
o Pentium chip devoted about 10% of chip area to
cache
o Pentium 4 devotes about 50%
Dr. Qurban Ali , EE Department
4
Intel Microprocessor Performance
Dr. Qurban Ali , EE Department
5
Summary - Improvements in Chip
o Increase hardware speed of processor
n Fundamentally due to shrinking logic gate size
o More gates, packed more tightly, increasing
clock rate
o Propagation time for signals reduced
o Increase size and speed of caches
n Dedicating part of processor chip
o Cache access times drop significantly
o Change processor organization and architecture
n Increase effective speed of execution
n Core technology, Parallelism
6
Performance Assessment - Clock Speed?
o Key parameters
n Performance, cost, size, security, reliability, power
consumption
o System clock speed in Hz or multiples of
n Clock rate, clock cycle, clock tick, cycle time
n Signals in CPU take time to settle down to 1 or 0
o Operations need to be synchronised
n Instruction execution in discrete steps - usually
require multiple clock cycles per instruction
n So, clock speed is not the whole story
Dr. Qurban Ali , EE Department
7
Performance Assessment -Instruction Execution Rate
o Millions of instructions per second (MIPS)
o It is expressed as:
!"
'
𝑀𝐼𝑃𝑆 π‘Ÿπ‘Žπ‘‘π‘’ = #$%&! = ()!$%&!
where Ic is instruction count, T and f relate to frequency
of the clock, and CPI is cycles per second and defined as:
∑$!"# 𝐢𝑃𝐼! π‘₯𝐼!
𝐢𝑃𝐼 =
𝐼𝑐
where CPIi is the number of cycles per instruction i and Ii
is the number of executed instructions of type i.
o Millions of floating point instructions per second(MFLOPS)
n Heavily dependent on instruction set, compiler design,
processor implementation, cache & memory hierarchy
π‘π‘œ. π‘œπ‘“ 𝑒π‘₯𝑒𝑐𝑒𝑑𝑒𝑑 π‘“π‘™π‘œπ‘Žπ‘‘π‘–π‘›π‘” π‘π‘œπ‘–π‘›π‘‘ π‘œπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘  𝑖𝑛 π‘Ž π‘π‘Ÿπ‘œπ‘”π‘Ÿπ‘Žπ‘š
𝑀𝐹𝐿𝑂𝑃𝑆 π‘Ÿπ‘Žπ‘‘π‘’ =
8
𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‡π‘–π‘šπ‘’ π‘₯ 10!
Dr. Qurban Ali , EE Department
CPI Example
o A compiler designer is trying to decide between two code
sequences for a particular machine. Based on the
hardware implementation, there are three different
classes of instructions: Class A, Class B, and Class C,
and they require one, two, and three cycles
(respectively).
The first code sequence has 5 instructions: 2 of A, 1 of
B, and 2 of C
The second sequence has 6 instructions: 4 of A, 1 of B,
and 1 of C.
What is the CPI for each sequence?
Which sequence will be faster? How much?
Dr. Qurban Ali, EE Department.
9
MIPS Example
o Two different compilers are being tested on a 100 MHz
machine with three different classes of instructions: Class
A, Class B, and Class C, which require one, two, and three
cycles (respectively). Both compilers are used to produce
code for a large piece of software.
The first compiler's code uses 5 million Class A instructions,
1 million Class B instructions, and 1 million Class C
instructions.
The second compiler's code uses 10 million Class A
instructions, 1 million Class B instructions, and 1 million
Class C instructions.
Which sequence will be faster according to CPI and MIPS?
Dr. Qurban Ali, EE Department.
10
Another Example
o Our favorite program runs in 10 seconds on computer A,
which has a 400 MHz clock. We are trying to help a
computer designer build a new machine B, that will run this
program in 6 seconds. The designer can use new (or
perhaps more expensive) technology to substantially
increase the clock rate, but has informed us that this
increase will affect the rest of the CPU design, causing
machine B to require 1.2 times as many clock cycles as
machine A for the same program. What clock rate should
we tell the designer to target?“
o “Think” Questions
o Does doubling the clock rate double the performance?
o Can a machine with a slower clock have better performance?
Dr. Qurban Ali, EE Department.
11
Performance Factors and System Attributes
Ic
p
Instruction set
architecture
X
X
Compiler technology
X
X
Processor
implementation
Cache and memory
hierarchy
m
k
t
X
X
X
X
X
Table 2.9
Table 2.9 is a matrix in which one dimension shows the five performance
factors and the other dimension shows the four system attributes. An x in a
cell indicates a system attribute that affects a performance factor.
p: number of cycles to decode and execute; m: number of memory
references; Ic: instruction count; k: ratio between memory cycle time and
processor cycle time; τ: 1/f
12
Dr. Qurban Ali , EE Department
Desirable Benchmark Characteristics
Written in a high-level language,
making it portable across different
machines
Representative of a particular kind of
programming style, such as system
programming, numerical programming,
or commercial programming
Can be measured easily
Has wide distribution
System Performance Evaluation Corporation
(SPEC)
o Benchmark suite
n A collection of programs, defined in a high-level
language
n Attempts to provide a representative test of a
computer in a particular application or system
programming area
o SPEC
n An industry consortium
n Defines and maintains the best known collection
of benchmark suites
n Performance measurements are widely used for
comparison and research purposes
SPEC
CPU2006
o Best known SPEC benchmark suite
o Industry standard suite for processor
intensive applications
o Appropriate for measuring performance
for applications that spend most of their
time doing computation rather than I/O
o Consists of 17 floating point programs
written in C, C++, and Fortran and 12
integer programs written in C and C++
o Suite contains over 3 million lines of
code
n Fifth generation of processor intensive suite
from SPEC
n Speed and Rate Metric
o Single task and throughput
SPEC Speed Metric
o Single task
o Base runtime defined for each benchmark using
reference machine
o Results are reported as ratio of reference time to
system run time
n Trefi execution time for benchmark i on reference machine
n Tsuti execution time of benchmark i on test system
o Overall performance calculated by averaging ratios
for all 12 integer benchmarks
n Use geometric mean
n Appropriate for normalized numbers such as ratios
Dr. Qurban Ali , EE Department
16
SPEC Rate Metric
o Measures throughput or rate of a machine carrying out
a number of tasks
o Multiple copies of benchmarks run simultaneously
n Typically, same as number of processors
o Ratio is calculated as follows:
n Trefi reference execution time for benchmark i
n N number of copies run simultaneously
n Tsuti elapsed time from start of execution of
program on all N processors until completion of all
copies of program
n Again, a geometric mean is calculated
Dr. Qurban Ali , EE Department
17
Amdahl’s Law
o Gene Amdahl
o Potential speed up of program using multiple
processors
o Concluded that:
n Code needs to be parallelizable
n Speed up is bound, giving diminishing returns for
more processors
o Task dependent
n Servers gain by maintaining multiple connections
on multiple processors
n Databases can be split into parallel tasks
Dr. Qurban Ali , EE Department
18
Amdahl’s Law Formula
o For program running on single processor
n Fraction f of code infinitely parallelizable with no
scheduling overhead -- (1-f) of code inherently serial
n T is total execution time for program on single
processor
n N is number of processors exploiting parallel code
Speedup=
π‘¬π’™π’†π’„π’–π’•π’Šπ’π’ π’•π’Šπ’Žπ’† π’˜π’Šπ’•π’‰π’π’–π’• π’†π’π’‰π’‚π’π’„π’Žπ’†π’π’•
π‘¬π’™π’†π’„π’–π’•π’Šπ’π’ π’•π’Šπ’Žπ’† π’˜π’Šπ’•π’‰ π’†π’π’‰π’‚π’π’„π’Žπ’†π’π’•
o Conclusions
n f small, parallel processors have little effect
n N ->∞, speedup bound by 1/(1 – f)
o Diminishing returns for using more processors
Dr. Qurban Ali , EE Department
19
Amdahl’s Law
An Example
o Suppose a program runs in 100 seconds on a machine,
with multiply responsible for 80 seconds of this time. How
much do we have to improve the speed of multiplication
if we want the program to run 4 times faster?
o How about making it five times faster?
Execution time after improvement = Execution time
unaffected+(Execution time affected/Amount of
improvement)
Dr. Qurban Ali, EE Department.
21
Little’s Law
o A fundamental and simple relation with broad
applications is Little’s law
n can be applied to almost any system that is
statistically in steady state, and in which there is no
leakage
o It applies to queuing system
n If server is idle, an item is served immediately,
otherwise an arriving item joins a queue
n There can be a single queue for a single server or for
multiple servers, or multiples queues with one being
for each of multiple servers
o Average number of items in a queuing system equals
the average rate at which items arrive multiplied by
the time that an item spends in the system
n Relationship requires very few assumptions
n Because of its simplicity and generality it is extremely
useful
Download