Chapter 6 - Measurment tools and techniques.

advertisement
Measurement tools and
techniques





Fundamental strategies
Interval timers
Program profiling
Tracing
Indirect measurement
Copyright 2004 David J. Lilja
1
Events

Most measurement tools based on events


Some predefined change to system state
Definition depends on metric being measured





Memory reference
Disk access
Change in a register’s state
Network message
Processor interrupt
Copyright 2004 David J. Lilja
2
Event Classification

Count metrics



The number of times event X occurs
Number of cache misses
Number of I/O operations
Copyright 2004 David J. Lilja
3
Event Classification

Secondary-event metrics




Record a value when triggered
by some event
Record block size for each I/O
operation
Count number of operations
Find average I/O transfer size
Copyright 2004 David J. Lilja
4
Event Classification

Profiles



Characterization of overall
behavior
Aggregate/big picture view of an
application program
Time spent in each function
Copyright 2004 David J. Lilja
5
Event-Driven Strategies



Record necessary information only when
selected event occurs
Modify system to record event
Dump data when program terminates


May need intermediate dumps also
E.g. simple counter in page fault routine
Copyright 2004 David J. Lilja
6
Event-Driven Strategies

System overhead




Only when the event of interest actually occurs
Infrequent events → little perturbation
Frequent events → high perturbation
No longer “typical” behavior?

Perturbation changes system being measured
Copyright 2004 David J. Lilja
7
Event-Driven Strategies

Inter-event time is unpredictable




Depends on when events actually occur
Makes it hard to estimate perturbation
How long to measure?
Event-driven measurement tools

→ Good for low-frequency events
Copyright 2004 David J. Lilja
8
Event-Driven Strategies
+1

+1 +1
+1
+1
+1 +1
+1
Counts 8 events exactly
Copyright 2004 David J. Lilja
9
Tracing


Similar to event-driven
But record additional system state




Overhead



Event has occurred – count
Additional information to uniquely identify event
E.g. addresses that cause page faults
Additional memory or disk storage
Time to save state
Relatively large system perturbation
Copyright 2004 David J. Lilja
10
Tracing
+1; +1; +1;
Addr AddrAddr

+1;
Addr
+1;
Addr
+1; +1;
Addr
Addr
+1;
Addr
Counts 8 events plus extra data
Copyright 2004 David J. Lilja
11
Sampling


Record necessary state at fixed time intervals
Overhead




Independent of specific event frequency
Depends on sampling frequency
Misses some events
Produces statistical summary


May miss infrequent events
Each replication will produce different results
Copyright 2004 David J. Lilja
12
Sampling
+1

+1
+1
Counts 3 events out of 5 samples
Copyright 2004 David J. Lilja
13
Comparisons
Event
count
Tracing
Sampling
Resolution
Exact
count
Detailed
info
Statistical
summary
Overhead
Low
High
Constant
~ #events
High
Fixed
Perturbation
Copyright 2004 David J. Lilja
14
Comparison

Event counting



Sampling



Best for low frequency events
Required if exact counts needed
Best for high frequency events
If statistical summary is adequate
Tracing

When additional detail is required
Copyright 2004 David J. Lilja
15
Indirect Measurements


Used when desired metric is not directly
accessible
Measure one thing directly


Derive or deduce desired metric
Highly dependent on creativity of
performance analyst
Copyright 2004 David J. Lilja
16
Interval Timers



Fundamental tool of performance
measurement
Measure execution time of any portion of a
program
Provide time basis for sampling
Copyright 2004 David J. Lilja
17
Interval Timers

Actually count clock pulses between two events
Event 1
Event 2
Tc
x1= Counter
x2 = Counter
Te=(x2 – x1)Tc
Copyright 2004 David J. Lilja
18
Using an Interval Timer

Within an application program
Start_count = read_timer();
Portion of program to be measured
Stop_count = read_timer();
Elapsed_time = (stop_count – start_count)
* clock_period;
Copyright 2004 David J. Lilja
19
Hardware Timer
Tc
n-bit counter
Clock
Te=(x2 – x1)Tc
To CPU input port
Copyright 2004 David J. Lilja
20
Software Timer
T’c
Prescalar
(divide-by-n)
Clock
Tc
CPU interrupt input
Te=(x2 – x1)Tc
Software counter
Copyright 2004 David J. Lilja
21
Quantization Errors
Copyright 2004 David J. Lilja
22
Quantization Error


Timer resolution
→ quantization error
Repeated measurements




nTc < Te < (n+1)Tc
Te rounded to ± one clock tick
Completely unpredictable rounding
→ Want Tc to be as small as possible
Copyright 2004 David J. Lilja
23
Timer Rollover

n-bit counter



Rollover = transition from (2n – 1) → 0
If rollover occurs between start/stop events


count = [0, 2n-1]
Then count = (x2 – x1) < 0
Check for count < 0


Measure again
Add 2n to count
Copyright 2004 David J. Lilja
24
Timer Rollover
Counter width, n
Resolution
(Tc)
10 ns
16
32
655 us
43 s
58.5 cent
1 us
65.5 ms
1.2 h
5,580 cent
100 us
6.55 s
5 days
585,000 cent
1 ms
1.1 min
50 days
5,850,000 cent
Copyright 2004 David J. Lilja
64
25
Timer Overhead
Start_count = read_timer();
Portion of program to be measured
Stop_count = read_timer();
Elapsed_time = (stop_count – start_count)
* clock_period;

To access timer



Min of 1 memory read → subroutine call
Min of 1 memory write → subroutine call
Once at start, again at stop
Copyright 2004 David J. Lilja
26
T1
T2
T3
Copyright 2004 David J. Lilja
Current time actually read
Event ends;
Initiate read_time()
Event being measured
begins
Current time actually read
Event begins;
Initiate read_timer()
Timer Overhead
T4
27
Timer Overhead




T1 = time to read counter value
T2 = time to store counter value
T3 = time of the event we are measuring
T4 = time to read counter value

T 4 = T1
T1
T2
T3
Copyright 2004 David J. Lilja
T4
28
Timer Overhead


Te = event time = T3
But actually measured



Tm = T2 + T 3 + T 4
Te = Tm – (T2 + T4) = Tm – (T1 + T2)
Timer overhead = Tovhd = (T1 + T2)
T1
T2
T3
Copyright 2004 David J. Lilja
T4
29
Timer Overhead

If Te >> Tovhd


If Te ≈ Tovhd



Ignore the timer overhead
Measurements will be highly suspect
Potentially large variations in Tovhd
Good rule of thumb

Te should be 100-1000x > Tovhd
Copyright 2004 David J. Lilja
30
Approximate Measures of
Short Intervals



How to measure an event that is shorter than
the resolution of the clock?
Cannot directly measure events with
Te < Tc
Overhead makes it hard to measure even
when Te > nTc,

n is small integer
Copyright 2004 David J. Lilja
31
Approximate Measures of
Short Intervals
Tc
Te
Te
Case 1:
Count+1
Case 2:
Count+0
Copyright 2004 David J. Lilja
32
Approximate Measures of
Short Intervals

Bernoulli experiment




Outcome = +1 with probability p
Outcome = +0 with probability (1-p)
Equivalent to flipping a biased coin
Repeat n times


Approximates a binomial distribution
Only approximate since each measurement
cannot be guaranteed to be independent

Usually close enough in practice
Copyright 2004 David J. Lilja
33
Approximate Measures of
Short Intervals

m = number of times Case 1 occurs




Count+1
n = total number of measurements
Average duration is ratio of m/n
Use confidence interval for proportions
m
Te  Tc
n
Copyright 2004 David J. Lilja
34
Example




Clock resolution = 10 us
n = 8764 measurements
m = 467 clock ticks counted
95% confidence interval
10 us
?
?
Case 1:
467
Case 2:
8297
Copyright 2004 David J. Lilja
35
Example
467 
467 
1 

467
8764  8764 
(c1 , c2 ) 
 1.96
8764
8764
 (0.0486,0.0580)


Scale by clock period = 10 us
95% chance that measured event is

(0.49, 0.58) us
Copyright 2004 David J. Lilja
36
Profiling


Overall view of program’s execution-time
behavior
Fraction of total time spent in specific states




Fraction of time in each subroutine
Fraction of time in OS kernel
Fraction of time doing I/O
Find bottlenecks, code hot-spots

Optimize those sections first
Copyright 2004 David J. Lilja
37
Statistical Sampling




Select a random subset
of a population
Gather information on
only this subset
Extrapolate this
information to overall
population
Results are a statistical
summary with
corresponding error
probabilities
Copyright 2004 David J. Lilja
38
PC Sampling
+1



+1
+1
Periodically interrupt program at fixed intervals
Record appropriate state information in interrupt
service routine
Post-process to obtain overall profile
Copyright 2004 David J. Lilja
39
PC Sampling

At each interrupt



Examine PC on return address stack
Use address map to translate this PC to
subroutine i
Increment array element H[i]
PC: 4582
Addr map
0-1298: Subr 1
1299-3455: Subr 2
3456-5567: Subr 3
5568-9943: Subr 4
Copyright 2004 David J. Lilja
Histogram
counters:
H[3]=H[3]+1
40
PC Sampling
140
120
100
80
60
40
20
Copyright 2004 David J. Lilja
2]
H
[1
1]
H
[1
0]
H
[1
]
H
[9
]
H
[8
]
H
[7
]
H
[6
]
H
[5
]
H
[4
]
H
[3
]
H
[2
H
[1
]
0
41
PC Sampling


n total interrupts
Post-processing step


H[i]/n = fraction of time executing in subroutine i
(H[i]/n) * (interrupt period) = time in each
subroutine
Copyright 2004 David J. Lilja
42
PC Sampling

This is a statistical process



Different counts each time the experiment is
performed
Infer behavior of entire program from small
sample
Apply confidence intervals to quantify
precision of results
Copyright 2004 David J. Lilja
43
Example




40 us interrupt
36,128 interrupts in subroutine A
Program runs for 10 seconds
Time in this subroutine?




90% confidence interval
m = 36,128
n = 10 sec / 40 us = 250,000
p = m/n = 0.144
Copyright 2004 David J. Lilja
44
Example
0.144512(0.855488)
(c1 , c2 )  0.144512  1.645
250000
 (0.144,0.146)

90% chance that the program spent 14.4-14.6% of
its time in subroutine A
Copyright 2004 David J. Lilja
45
Example



10 ms interrupt
12 interrupts in subroutine A
n = 800 samples


Time in this subroutine?


8 seconds total execution time
99% confidence interval
p = m/n = 0.015
Copyright 2004 David J. Lilja
46
Example
0.015(1  0.015)
(c1 , c2 )  0.015  2.576
800
 (0.0039,0.0261)




99% chance that the program spent 31-210 ms in
subroutine A
A pretty wide range!
But only <3% of total execution time
Start optimizing somewhere else first
Copyright 2004 David J. Lilja
47
Reducing the Interval Size


Use a lower confidence level
Obtain more samples

Run program longer


Increase sample rate



May not be possible
May be fixed by system
Will increase overhead and perturbation
Run multiple times and add samples from each
run
Copyright 2004 David J. Lilja
48
PC Sampling
+1

+1
Interrupts must occur asynchronously w.r.t. any
program events



+1
Samples must be independent of each other
Else over/under-sample events synchronous with interrupt
Periodic versus random sampling
Copyright 2004 David J. Lilja
49
Basic Block Counting

Basic block



Sequence of instructions with no branches into or
out of the block
When first instruction is executed, guaranteed
that all instructions in block will be executed
Single entry, single exit
Copyright 2004 David J. Lilja
50
Basic Block Counting

Generate a program profile by inserting
additional instructions in each block



Increment a unique counter each time a block is
entered
Produces a histogram of program execution
Can post-process to find instruction execution
frequencies
Copyright 2004 David J. Lilja
51
Comparison
PC sampling
Output
Overhead
Perturbation
Repeatability
Basic block
counting
Statistical
Exact count
estimate
Interrupt service Extra instructions
routine
per block
Randomly
High
distributed
Within statistical
Perfect
variance
Copyright 2004 David J. Lilja
52
Event Tracing

Profile shows overall frequency-of-execution
behavior


Ignores time-ordering of events
Program trace


Dynamic list of events generated by program
Events = anything you want to instrument



Sequence of memory addresses
I/O blocks accessed
Typically used to drive a simulator
Copyright 2004 David J. Lilja
53
Trace Generation
Modify to generate trace
Application
program
Compress
Uncompress
Trace
consumer
Copyright 2004 David J. Lilja
54
Trace Generation
Modify to generate trace
Application
program
Compress
Online trace
consumption
Uncompress
Trace
consumer
Copyright 2004 David J. Lilja
55
Trace Generation

Source-code modification


Allows precise control of what events are traced
and what data is recorded
Typically a manual process
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
56
Trace Generation

Software exceptions


HW forces an exception before each instruction
Exception routine decodes instruction



Store instr type, PC, operand addresses, etc.
“Trace” bit in many processors
Tremendous slowdown
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
57
Trace Generation

Emulation



Make a system appear to
be something else
Modify emulator to
generate trace
E.g. Java Virtual Machine
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
58
Trace Generation

Microcode modification


Modify instruction execution directly
Allows tracing of all instructions



Including operating system
Depends on access to lower levels of the
processor
E.g. Transmeta Crusoe processor
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
59
Trace Generation

Compiler modification


Insert trace code directly in object file
Requires access to the compiler itself
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
60
Trace Generation

Compiler modification



Insert trace code directly in object file
Requires access to the compiler itself
Write post-compilation binary editor/rewrite tool
Source
code
Compiler
Object
code
Copyright 2004 David J. Lilja
Proc
Trace
61
Trace Data




Tracing generates a tremendous volume of
data
Trace 100,000,000 instrs/sec
16 bits of data per event
190 Mbytes of data per second


11 Gbytes per minute
Huge perturbations


Due to tracing code
Time to store trace data
Copyright 2004 David J. Lilja
62
Trace Data Compression
Modify to generate trace

Application
program
Compress


Standard compression
algorithms as trace is
written to disk
Uncompress when
reading
Typical reduction

Uncompress

20-70%
Tradeoff is compressuncompress time
Trace
consumer
Copyright 2004 David J. Lilja
63
Online Trace Consumption
Modify to generate trace


Application
program

Use trace data as it is
generated
Never stored on disk
Multitasking may lead to
non-deterministic behavior

Online trace
consumption

Before-and-after
comparison tests

Trace
consumer
Repeatability issue

Copyright 2004 David J. Lilja
Difference due to change in
system or change in trace?
Becomes statistical
comparison with n runs
64
Abstract Execution


Use higher-level information to intelligently
compress trace info
Two-step process

Compiler-style analysis to find critical subset of
trace


Store only control flow information sufficient to
reconstruct trace later
Produce trace-regeneration code for subsequent
use of trace
Copyright 2004 David J. Lilja
65
Abstract Execution
1. if (i > 5)
2.
then a = a + i;
3.
else b = b + i;
4. i = i + 1;





1. if (i>5)
2. a=a+i
3. b=b+i

4. i=i+1
Trace will be either

1-2-4
1-3-4
Store only 2 or 3
Combine with compilergenerated control flow
graph to regenerate
trace
Slowdown = 2-10x
Compress = 10-100x
Copyright 2004 David J. Lilja
66
Trace Sampling





Save only subsequences of overall trace
Drive simulator with samples
Results should be statistically similar to
driving with complete trace
One sample = k consecutive events
Sampling interval = P (period)
k
k
P
Copyright 2004 David J. Lilja
67
SimPoint

Find “representative” program samples





Match basic block execution frequencies
Clustering tool to automate process
Perform detailed timing simulation on only
these samples
Fast-forward (functional simulation) between
samples
[Sherwood et al, ASPLOS, 2002]
Copyright 2004 David J. Lilja
68
SimPoint


Weight each sample’s result by execution
frequency to produced overall result
Relatively small number (10s) of SimPoints
produced ≈3% error in IPC on SPEC
Copyright 2004 David J. Lilja
69
SMARTS

Uses systematic sampling


Fixed sample interval
Apply statistical sampling techniques to
determine j, k, P
Functional simulation
Detailed simulation
j
k
j
k
P
Copyright 2004 David J. Lilja
70
Indirect Ad Hoc Techniques


Sometimes the desired metric cannot be
measured directly
Use your creativity to measure one thing and
then derive/infer the desired value
Copyright 2004 David J. Lilja
71
Example – System Load

What is system load?





Number of jobs in run queue?
Number of jobs actively time-sharing?
Fraction of time processor is not in idle loop?
Others?
How to measure it?



Modify OS
PC sampling
Indirect?
Copyright 2004 David J. Lilja
72
Example
T
Monitor


Count
n
Let system run for fixed time T
Note value of counter
Copyright 2004 David J. Lilja
73
Example
T
Monitor
n
Monitor
n/2
App 1


Count
Let system run for fixed time T
Compare value of loaded system monitor
counter to unloaded system count value
Copyright 2004 David J. Lilja
74
Example
T
Monitor
Count
n
Monitor
n/2
App 1
Monitor
n/3
App 1
App 2


Let system run for fixed time T
Compare value of loaded system monitor
counter to unloaded system count value
Copyright 2004 David J. Lilja
75
Perturbation

To obtain more information (higher resolution)


→ Use more instrumentation points
More instrumentation points

→ Greater perturbation
Copyright 2004 David J. Lilja
76
Perturbation
Computer performance measurement
uncertainty principle

Accuracy is inversely proportional to
resolution.
High
Accuracy

Low
Resolution
Copyright 2004 David J. Lilja
High
77
Perturbation

Superposition does not work here



Double instrumentation ≠ double impact on
performance



Non-linear
Non-additive
Some instrumentation cancels out
Some multiplies impact
No way to predict!
Copyright 2004 David J. Lilja
78
Instrumentation Code

Changes memory access patterns


Generates additional load/store instructions




More frequent cache flushes and replacements
But may reduce set associativity conflicts
Generates more I/O operations
Will increase overall execution time


Affects memory banking optimizations
More time-sharing context switches
Alters virtual memory paging behavior
Copyright 2004 David J. Lilja
79
Important Points

Event types



Simple counts of primary event
Secondary events triggered by some primary
event
Overall profiles
Copyright 2004 David J. Lilja
80
Important Points

Measurement strategies




Event-driven
Tracing
Sampling
Indirect approaches
Copyright 2004 David J. Lilja
81
Important Points

Interval timers





Stopwatch functionality
Rollover problem
Overhead
Quantization errors
Statistical measures of short intervals
Copyright 2004 David J. Lilja
82
Important Points

Profiling

PC sampling


Statistical view
Basic block counting


Exact behavior
High overhead and perturbation
Copyright 2004 David J. Lilja
83
Important Points

Trace generation








Source code modification
Force exceptions
Emulation
Microcode modification
Compiler modification
Object code editor
Online trace consumption
Trace sampling
Copyright 2004 David J. Lilja
84
Important Points

Indirect measurements when all else fails


System load example
Perturbations


Nobody likes them
Have to learn to live with them
Copyright 2004 David J. Lilja
85
Download