EFFECTIVENESS OF SAMPLING BASED SOFTWARE PROFILERS IC. J. INDIA

advertisement
EFFECTIVENESS OF SAMPLING BASED SOFTWARE PROFILERS
IC. V. Subramaniam and Matthew J. Thazliuthaveetil
Computer Science and Automation,
Inclian Institute of Science,
Bangalore 560012, INDIA
Email: (l;vs,mjt}Ocsa.iisc.ernet.in
been compiled for profiling. Tlie counters are later used to
estimate how the total execution time of the program was
spent in its various subprograms [3].
Weinberger [3] suggests the following sources of inaccuracies in the profiles generated by prof: (i) One or more of
the counters corresponding to (8 byte) chudis of code may
straddle two subprograms, in which case the counts may
get attributed to the wrong subprogram. (ii) If event lifetimes are much smaller than the sampling period, a short
lived subprogram may never be sampled. (iii) A resonance
Abstract
Profilers are a class of program monitoring tools which
aid in tuning performance. Profiling tools which employ PC
(Program Counter) sampling to monitor program dynamics
are popular owing to their low overhead. However, the effectiveness of such profilers depends on the accuracy and coinpletenes of the measurements made. We study the methodology aiid effectiveness of PC sampling based software profilCl3
I. INTRODUCTION
or beating effect could occur when the sampling period and
the time taken to execute a loop are I lentical. Ponder and
Fateman [4]also identify the resonance effect as a potential
source of inaccuracies in prof in their analysis of inaccuracies in gprof, but argue that such instances of resonance
are uncommon in real programs. The other two problems
cannot be ignored as easily. In the presence of these and
other possible sources of error, a profile generated by prof
only a rough estimate of program behaviour. T
racy of the measurement could
possibly be increased by doing more profiling runs. This
prompts the following questions: (i) flow approximate are
the measurements made b y p r o f ? ( z i ) Can these approximate measurements also be inaccurate? (iii) How many
profiling runs are required to arrive at an accurate estimate
of program behaviour?
To illustrate the degree of inaccuracy, we conducted the
following experiment: Five profiling runs of dhrystone’ were
done on the same machine under identical workload conditions. The results of the experiments are shown in Table I.
Observe that even after 5 profiling runs, tkemost frequently
executed subprograms cannot be positively identified. We
conclude that measurements made by prof can be quite inaccurate, and question the rationale of performing multiple
profiling runs - with each run contributing a 100% profiling
overhead: just to arrive at a “rough” estimate of program
locality. There is clearly a need to better understand the
causes of these inaccuracies.
Profilers arc. software tools that provide the programmer
wit.li estimates of how much execution time is spent in the
various components of his program. Commonly used profilers are known to suffer from deficiencies. We evaluate the
effectiveness of profilers based on their Accuracy (reflecting the c1osenc:ss of the measurement made by the profiler
to the actual case), and Variation (across profiling runs;
a good profiler should show little or no variation in measurements) [l]. In this paper we study profilers based on
PC (Program Counter) sampling, examples of which include
lJnix prof and Borland Turbo Profiler [2] operating in the
passive mode. In the remainder of this paper, the term prof
is used for any profiler that uses this profiling technique. In
t.ltis paper, we study prof inaccuracy and its causes, and
suggest a teclii?iqrieto enhance accuracy.
11. MOTIVATION
The accuracy of the profile generated by a profiler is often
important. To a programmer using a profiler as a decision
making tool i n optimizing portions of a program, wasted effort could result from such inaccuracies. Execution profilers
such as prof are expected to provide accurate measurements
at a low time overhead. prof associates an array of countcm with the program to be profiled. The program itself
is vic-i:cd as a sequence of equal sized chunks (typically 8
bytes of profiled rode), and one counter is associated with
each such chunk. O n every clock tick Jiiiiiig the execution
of the program being profiled. the counter associated with
thr. sampled PC value is incremented. This activity is initiated by the timer interrupt handler, for programs that have
0-7803-2608-3
__
-----
‘?O,OOO iterations of the synthetic berichmark Dhrystone
1
[5]
E
TABLE I
prof
Proc
main
I
I I 1:%i
results of top 6 procedures on 5 successive runs of
dhrystone - 20,000 iterations
I
%t Proc
I
I
I
%t Proc
%t Proc
I 16.6 I strcpy I 14.0 Procl I 21.1 I main
7.5
12.8
11.0
9.9
8.7
7.6
15.4
11.4
10.3
F u n d 7.4
strcmp 6.9
main
ProcJ
strcpy
I %t
I 16.6
main
Procl
strcpy
strcpy
strcmp 11.8
strcmp
Proc-8
-write
6.5
111. CAUSES O F INACCURACY A N D VARIATION
We differentiate between profile inaccuracy and variation
The inaccuracy of any one profiling
run is the difference between the profile it generates and
the actual distribution of cycles among the subprograms of
the program. For a given input data set, a program should
follow the same execution path in different runs. Ideally,
we would expect neither the total number of cycles taken
by the program, nor the distribution of these cycles across
subprograms, to vary across these runs. Therefore, given
that the the prof sampling interrupt occurs at fixed, periodic intervals during the execution of a profiled program,
profiles should not vary across profiling runs.
We identify the following as causes of profile inaccuracy:
process is rescheduled, its cache context must be reconstructed. The time taken for the profiled process to rebuild
it's cache context will depend on (a) how much of the cache
is occupied by a process during one quantum, and (b) the
background load, which characterizes the number of quanta
after which the profiled process is scheduled again.
Having identified the possible causes of error and variation
in results reported by prof, we studied the effect of each
through trace driven simulation.
acmss profiing runs.
r
t t t t t t t
Beating Effect:- As mentioned earlier, it is possible that the
sampling interval and the time taken to execute a loop are
identical. As a result, an inordinate proportion of the PC
samples taken might be to a particular subprogram.
procedure 1(PI)
Gmnularity of Sampling:. Consider a system operating at 12
Mhz, with the clock tick (timer interrupt) occurring every 10
ms. For a processor that can execute one instruction per cycle, about 120,000 instructions would be executed between
two PC samples. Many subprograms might have single invocation instruction path lengths that are much shorter than
this. Such short lived events occurring between sampling
interrupts might escape capture altogether. Figure 1 illustrates this problem.
Profile variations across multiple runs can be attributed to
external events -events caused by other activities executing
on the system, resulting in changes to the effective sampling
interval. This background load on the system can affect the
sampling interval in the following ways:
t
Fig. 1.
Processor Pre-emption:. 1/0 activity on the system gener-
#samples actual time
profiling interrupt
Error in prof results due to large sampling interval
t
t
ates interrupts that are serviced at high processor priority,
pre-empting the execution of the profiled process. From
the perspective of the profiled process, this has the effect of
changing the sampling interval, as illustrated in Figure 2.
The ftuctuation in sampling interval is a function of the interrupt frequency' which is related to the load on the system.
Quantum Size = 24
Interrupt Service = 6
Cache Effect:. At the end of a CPU quantum, when another
process gets scheduled instead of the profiled process, the
cache context built up by tlie profiled process is replaced by
that of newly scheduled process. Thus, when the profiled
Effective Quantum Size = 18
interrupt Service
t
Timer tick
Reduction in the sampling interval due to interrupts
generated by background load
Fig. 2.
2
ing load as follows: The number of lines lost by the profiled
process on a context switch is assumed to increase exponentially with the increase in background load. As both the
Instruction Cache and the Data Cache are assumed to be
small, we assume that the entire cache context is lost on
a fully loaded machine [9]- However, on a system with no
load, the profiled process is the only process on the system,
and so the loss in cache context is only due to the execution
of the profile interrupt handler and the context switching
code. We consider only the effects of the profile interrupt
handler. The loss in cache context due to the execution of
thekterrupt handler is viewed .as a loss of a fixed set of
lines in the I Cache (due to execution of the handler code)
and a loss of a set of lines from tbe D Cache. The address
of the lost lines in the D Cache depends on the address of
the currently executing instruction and the loss can be attributed to the incrementing of the profile countefs by the
handler.
All of our simulation assumptions were validated through
experiments performed under controlled conditions [lo].
IV. SIMULATION
In our simulations, we assumed a similar computing environment to that on which the experirent described earlier
was conducted - a Unix workstation based on the htIPS
2000 processor. The simulations were driven by traces generated on that system using p i x i e [SIwith the -idtrace option. y e used the traces of 20,000 iterations of the synthetic
benchmark dhrystone2 (referred to as dhry), the SPICE
benchmxk from the Perfect benchmark suite (referred to as
as) [7],and the Monte Carlo simulation benchmark QCD2
from the Perfect benchmark suite (referr-4 to as lgs). Each
trace file was approximately 40h4B in size, correspondingto
15-20 million inskructions executed.
Simulation Model
System modelling:. We assume that the processor operates
with a 12MHz clock, with an SKB direct mapped data cache
and a 16KB direct mapped instruction cache. The line size
of the data cache is 4 bytes, and there is a cache miss penalty
of 5 cycles. We assume that the same parameters hold for
the instruction cache a.. well. There is a 4 cycle stall between
successive memory store instructions. A context switch is
associated with every profiling interrupt. This assumption
was made after we obsecved that the getrusage0 report
for the number of context switches during a profiling run is
identical to the number of samples collected by prof. The
prof sampling period is 10 ms. We assume an overhead
of 4500 cycles for every sampling period. This overhead
can be attributed to the execution of the profiling interrupt
handler and the context switch that occurs at the end of the
sampling period.
v.
RESULTS
We quantify the accuracy of a profile relative to the distribution of cycles among the subprograms of the program
under certain ideal con'ditions, which we refer to as the true
histogram. The true histogram is defined as the normalized number of cycles per subprograms of the profiled program executing under no load. Under taese conditions, the
profiled process is the only process in zxecution. The true
histogram is estimated in the absence of operating system
overheads due to context switching and profiling. We computed the true histograms for each of our benchmark program traces through simulation. The inaccuracy of a measured profile was then quantified as its deviation from the
true histogram. We use the chi square measure to quantify
the difference between two histograms.
Simulations of profiled execution were carried out under
no l,oad, half load and full load conditions, with load characterized by interrupt activity and/or loss in cache context.
Figure 3 compares the true histogram of css and the simulation results under no load conditions. The chi square values
showing the error in the results reported by prof under no
load conditions for each of the application traces is shown in
column 1 of Table 111. Table I1 compares the measurements
made by prof under varying load conditions using 4he chi
square measure.
From Figure 3, it is clear that the error is not due to
sampling errors in just one or two subprograms, but across
many subprograms, eliminating the resonance phenomenon
as source of error in this case. It appears that the likely
cause of the errors is the fact that the sampling interval is
too large to capture all events. To study this hypothesis,
we reran the simulations with an initial skew of a few cycles, i.e., by shifting all the samples by a few cycles. W e
would expect this shift to have little effect on the measurements made, but the simulation results indicate otherwise.
Table IV shows the chi-square measure between the skewed
System workload modelling:. The number of processes contending for the CPU has a significant impact on the measurements made by prof. Two factors that are effected by
increases in load are (i) increased loss of cache context, and
(ii) increased interrupt handling overheads. Other events,
such as page fault handling, are assumed not to affect the
effective sampling period; we assume that they are not accounted against the profiled process even if they occur in its
CPU quanta. *Systemworkload is therefore modelled using
two parameters:
1. Amount of interrupt activity: The number of interrupts
occurring in a quantum is assumed to increase linearly with
background load, reducing the effective length of the CPU
quantum proportionately. We assume a loss of 10% of quantum size on a fully loaded system due to the interrupt activity caused by the background load. We assume that the
load conditions are maintained at a constant level through
the profiling run.
2. Loss in Cache context: The effett of context switching
on cache context has been studied by Mogul et a1 [SI and
others. We quantify the loss in cache context with increas'AU printf 0 statements are removed from the dhrystone code to eliminate terminal 110
3
0.4
TABLE I1
chi square measure showing variation on the measurements
made by prof under varying loads
I
only cache
effects
only i n t e r r u p t s
effects
b o t h cache a n d
interrupt effects
I
1
I
half 1.79 e6
full 8183 e6
half 5.48 e6
full 1.30 e8
half 2.21 e6
full 16.90 e6
I
I
7.81
5.02
4.93
18.14
I 6.27
7.39
I
css
ks
e6
e6
e6
e6
e6
e6
6.13 e5
8.85 e5
7.63 e5
1.61 e6
16.81 e5
1.92 e6
1
I
0.3
U
0
0.2
I
TABLE 111
Chi square measure between the true histogram and
simulation results for the prof interval of 10ms and the
program dependent profiling frequency
I
I
Application
dhry
css
Igs
Benchmark
Name
dhry
CSS
Igs
Skew
(in cycles)
1
Interval
lams
I CModified
4.50 e06
9.01 e05
5.00 e06
2.61 e05
2.30 e06
I 5.72 e06 1
No Load
0
0.0
10 11.2 e06
0.0 I
01
10 I 3.1 e06 I
0.0 1
0 )
10 I 0.2 e06 I
0.1
I
a
Fig . 3 . Error in prof measurements with a sampling interval of
120,000 cycles
10% Loss
Full Load
10% Loss
2.2 e06
3.1 e06
6.2 e06
6.7 e06
0.6 e06
0.6 e06
6.9 e06
7.3 e06
7.3 e06
7.3 e06
1.9 e06
1.9 e06
Half Load
I
I
1
I
Profiling at this frequency should guarantee accurate results. However, for all three of our traces, the minimum
number of cycles between subprogram changes is Zcycles,
yielding an ideal sampling interval of 1 cycle, which is equivalent to single stepping through the program! If this is
caused by a routine with a low execution count, we could
use a larger profiling interval without much loss in accuracy. We arrive at such a practical sampling interval using
the average number of cycles between subprogram changes,
calculated as
c A 4 o d i f zed
= cAvg/'s
where:
C h f o d t f t e d is
the modified profiling interval and
is the Average number of cycles between
subprogram changes.
The modified sampling intervals were found to be 12 cycles for dhry, 53 cycles for css and 7s cycles for lgs. Simulations of profiling under these sampling intervals were then
conducted. Since the new sampling interval is orders of
magnitude smaller than the default value used earlier, we
modified our simulation assumptions as follows: (i) We no
longer assume that a context switch will occur at the end of
every sampling interval. Under such an assumption, with a
small sampling interval, many cycles will be wasted in undoing the effects of the context switch. We therefore assume
that a contest switch occurs every 30 ms, the default CPU
quantum for a process on our Unix system. (ii) We assume
that the overhead associated with profiling is 50 cycles. Earlier, both the profiling handler and the context switch code
were executed simultaneously, so it is possible that the profile interrupt handler was executing some extra code. Based
on an analysis of the activity involved in profiling,'we estimate that a carefully hand coded profiling interrupt handler
should be able execute in 50 cycles.
trace under different load characterizations and prof measurements under no load conditions.
We conclude that to improve the accuracy of measurements made by prof, it is necessary to use a sampling
interval that is smaller than the default interval of 10ms.
Further, the choice of profiling interval should be program
dependent.
CA,
VI. MODIFICATION
TO prof
In this section we study the efficacy of of a program dependent profiling interval as a means of improving the accuracy of the measurements made by prof. Let us view the
program in execution in terms of the variation in PC value
with time. We assume that the program unit of interest is
the subprogram, and represent each subprogram by it entry PC value. Using considerations similar to the Sampling
Theorem, the ideal sampling interval would then be :
Cldeal = cM,n/2
where
C I d c a l is the ideal profiling interval and
C A ~ ,is, the minimum number of cycles between
whprogram changes.
4
Results with modified sampling interval
Figure 4 compares the true histogram of css with the measurements made by prof under no load with the modified
sampling interval. Unfortunately, the behaviour illustrated
in this figures was shown only by the css benchmark. A
comparison of the chi square measure for modified sampling
interval against the fixed i n t e r d under no load conditions
is shown in Table 111. While there is a significant reduction in the error for css, and a marginal reduction for dhry,
the error for the lgs prograni actually increased with higher
sampling frequency. A similar comparison of the variation
in prof results with load (see Table V) indicates a drop in
the variation in the case of lgs, but wide variation for css
and dhry.
We also studied the affect of initial skew in the variation
in measurements, under different load conditions The chisquare measures for the three benchmarks are tabulated in
Table VI. We conclude that the modified sampling interval
is capable of capturing most events, but does not guarantee niore accurate measurements than those obtained with
the default sampling interval. Hence, to guarantee accurate measurements using p r o f , we have no choice but to
use the ideal sampling interval - even if it implies interrupting program execution every cycle, which amounts to single
stepping through the execution of the program.
0.3
U
.
I
0
0.2
0.1
Zig. 4. True histogram VIS
prof a t
modified profiling interval
Rb load results for css with
TABLE VI
Chi square measure showing variation in prof with load and
skew using a program dependent profiling interval
VII. CONCLUSION
While program profilers like Unix prof are widely used as
program tuning tools, their effectiveness in capturing actual
program behaviour is not well understood. We have studied
the interaction between the profiling process and the processor state for a typical RISC based Unix system. We conclude that PC sampling profilers which use a fixed sampling
interval, like Unix p r o f , can not be relied on to produce accurate measurements. Although this makes them unsuitable
as architecture analysis tools, if used with multiple profiling
runs, they can provide useful information to programmers
interested in identifying program bottlenecks. PC sample
based profiling using program dependent sampling intervals
maybe capable of providing accurate measurements, provided the ideal sampling interval is used. This carries a
high time overhead with it; in the cases studied, accurate
profiling would involve single stepping through program execution. A less restrictive sampling interval was also studied,
but could not guarantee accurate measurements.
REFERENCES
[l] C.B. Stunkel, 3. Janssens, and W.K. Fuchs. Address Tracing for Parallel Machines. IEEE Computer, pp 31-38, Jan
1991.
[2] Borland International Inc. Turbo Profirer Version 1.0 User's
Guide, 1990.
[3] P.J. Weinberger. Cheap Dynamic Instruction Counting.
ATBT Bell Lab Tach J"a1,
63(8):1815-1826, Oct 1984.
141 C. Ponder and R. J. Eateman. Inaccuracies in Program
ProNers. Softzoane - P w t i c e and Experience, 18(5):459[5] R.P. Weidker. An Overview of Common Benchmarks. ZEEE
Computer, 1990.
[GI M.D. Smith. Tracing with pixie. Tech Report CSL-TR91-497, Computer Systems Lab, Stanford University, Nov
1991.
[7] G. Cybenko, L. Kipp, L. Pointer, and D. Kuck. Supercom-
TABLE V
chi square measure showing variation in results reported by
prof with load using a program dependent sampling interval
load characterization
only cache
effects
load
noload
half load
dhry
. 0.0
227.5
uter Performance Evaluation and the Perfect Benchmarks.
fn Proc Inntl Conf on Supercomputing, pp 254-266, 1990.
[8] J.C. Mogul and A. Borg. The Effect of Context Switches on
Cache Performaace. In PTOC4th ACM/IEEE Conf on Arch
Su port fur Prog Lung and Operating Systems, pp 75-84,
19JL
[9] S. Laha, J.H. Patel, and R.K. Iyer. Accurate Low Cost
css
Igs
0.0
0.0
301.0 440 0
Methods for Performance Evaluation of Cache Memory SysIEEE Trans Fomp, pp 1325-1336, Nov 1986.
[lo] K. Y. Subramdam. Momtoring Program Dynamics: An
Evaluation of Software ProNer Effectiveness. Master's thesis, Com uter Sdence and Automation, Indian Institute of
Science,
1993.
tems.
interrupt effects
I full load I 640.2 I 462.0 I 348.2
kov
5
Download