Abstract - International Association for Computer and Information

advertisement
1 inch (“)
This Sample shows the 1st two & Last two pages of a paper
No page numbers
A Comparison of Path Profiling and Edge Profiling
In C++ Applications
Brian A. Malloya
Clemson University,U.S.A
0.75’’
Abstract
Font 12 pt & bold
Body Text font size 10 single
spacing
0.25’’
0.75’’
We investigate the notion of phases as they occur in
object-oriented programs. The focus of our investigation is
on C++ programs and our test suite includes various kinds
of object-oriented programs including scientific and
general-purpose applications. We focus on individual
phased branch behavior and attempt to capture information
and generalize about the frequency of phased behavior. We
provide profiling in determining phase change in various
program executions.
Keywords: Profiling, edge profiling, path profiling,
performance, control flow graph.
1. Introduction
The tremendous interest and growth in the internet has
brought with it a demand for mobile code that is compact
and efficient. The idea behind mobile code is that
applications can be distributed quickly across computer
networks and automatically executed upon arrival. For these
applications to run efficiently, they must adapt to the runtime environment of the target architecture. In view of the
demand for speed, there is evidence that traditional static
optimizations may not provide the efficiency required for
modern mobile applications. Moreover, there is increased
use of object-technology in these mobile applications with a
concurrent belief that object-oriented code, with its frequent
use of dynamic binding, is less amenable to traditional,
static optimizations.
One optimization in common use today is feedbackdirected optimization, FDO, which uses program
characteristics, obtained at runtime, to attempt to improve
the performance of the application. Most FDO approaches
use profile guided compilation, which facilitates
determination of those parts of a program to modify in order
to maximize performance [26].
Profiling entails instrumenting an application to obtain a
cost metric such as time spent in a method or the number of
times a method executes. This metric can indicate those
parts of a program where optimization is likely to achieve a
parts of a program where optimization is likely to achieve a
a
Computer Science Department
Clemson, SC 29634, USA
malloy@cs.clemson.edu
If the input to a program changes, the optimizations
performed on a previous execution may no longer be
beneficial [27]. In fact, the performance of a statically
optimized program may degrade appreciably, so that
performance is worse than the un-optimized version of the
program [26].
A second drawback with FDO relates to the
determination of the instrumentation strategy that is likely to
provide the most information at a tolerable cost. Traditional
approaches to profile guided optimization construct a
graphical representation of the program that describes the
flow of control, and then place counters at vertices or edges
in the graph. Edge profiles record an aggregate count of the
frequency of edge executions that provide general
information about the behavior of a program.
However, reference [26] argues for the collection of
more detailed information than aggregate edge information
to describe a program's run-time characteristics, introducing
the notion that the run-time behavior of a program cycles
through a series of phases. For example the branch
prediction and cache miss rate might be more accurately
described over the execution of a program using phases
[23]. Phase profiling might address the shortcomings of
edge profiling if one could detect changes in phase and
apply an optimization more suited for a particular phase and
later apply a different one for a different phase.
Phase profiling might address the shortcomings of edge
profiling if one could detect changes in phase and apply an
optimization more suited for a particular phase and later
apply a different one for a different phase.
Phase profiling might address the shortcomings of edge
profiling if one could detect changes in phase and apply an
optimization more suited for a particular phase and later
apply a different one for a different phase.
profiling if one could detect changes in phase and apply .
apply a different one for a different phase.
profiling if one could detect changes in phase and apply .
apply a different one for a different phase.
optimization more suited for a particular phase and later
apply a different one for a different phase.
This metric can indicate those parts of a program where
optimization is likely to achieve
Profiling entails instrumenting an application to obtain a
cost metric such
Should be an exact 0.75’’ from the above paragraph line and the end of the page
No page numbers
Please put
title at the
bottom of the
figure
area of program optimizations, phases and the use of
profiling. Finally, in Section 8 we summarize and conclude.
2. Background
In this section, we provide definitions of terms and
background about profiling. We discuss both edge and path
profiling including a greedy algorithm for computing edge
profile information. We conclude this section with a
discussion of phases and phased behavior, including the
impact of phased behavior on superblock scheduling.
Should have
figure no. like
say figure 1
Phased Behavior. This is a branch taken from the lcom
benchmark which exhibits phased behavior. The pink line at y =
55% represents the average bias for the branch. The blue line
represents the sampled taken bias for the branch. The sampled bias
seems to interleave between periods of being strongly taken and
being weakly taken. An optimization based purely on the
aggregate bias would adversely affect performance during the
period of very weak taken bias.
To illustrate the impact that phased behavior might have
on optimization, consider the graph in Figure [1], which
models the run time branching behavior of an application.
The branch shown in the figure is biased towards “taken”
55% of the time; that is, aggregate information would lead
the profiler to conclude that the branch is likely to be
“taken” more often than not. However, the sampled taken
bias interleaves between periods of strong bias towards
“taken” and periods of “not taken”. An optimization based
strictly on the 55.5% average bias would perform poorly
during the phases with significant bias towards ``not taken''.
In this paper, we investigate the notion of phases as they
might occur in object-oriented programs. The focus of our
investigation is on C++ programs and our test suite includes
various kinds of object-oriented programs including
scientific and general-purpose applications. We focus on
individual phased branch behavior and attempt to capture
information and generalize about the frequency of phased
behavior. We provide guidance about the advantages of
using path profiling over edge profiling for improving the
performance of programs that exhibit phased branch
behavior.
The remainder of this paper is organized as follows. In
the next section we provide background about various
approaches to program profiling, including the notion of
superblock scheduling. In Section 3 we describe the
simulation tool that we exploit to investigate phased
behavior and in Section 4 we describe our case study that
forms the basis of our conclusions about phased behavior. In
Section 5 we present the results obtained from analyzing
phased behavior for individual branches and discuss these
results in Section 6. Section 7 overviews related work in the
2.1 Profiling
One of the common uses of profiles is to derive paths
taken during program execution for use in path based
optimizations [7]. Edge profiling records the execution
count of transitions between basic blocks [5] or edges.
Since edge profiles summarize path execution using
aggregate edge counts [26, 6], programs paths must be
derived from a profile. Generally path based optimizations
are performed on hot paths, heavily executed paths, in order
to gain the best performance gain with limited resources.
One technique for deriving heavily executed paths from
an edge profile is to use a greedy algorithm that follows the
edge with the maximum execution frequency out of a basic
block[6]. This greedy algorithm does not always produce
accurate results; more specifically, the greedy algorithm
may misidentify which path significantly contributes to
overall control flow of a program. For example, given an
edge profile for a CFG such as Figure 2a, a greedy
algorithm would incorrectly identify ACDEF instead of
ABDEF as the most frequently executing path.
Path profiling addresses the shortcomings of edge
profiles by recording the actual paths taken within an
application[6].
Looking at Figure 2a, path profiling can disambiguate
between the two paths and determine which path actually
contributed significantly to the control flow of the program.
Since the frequency counts for actual paths are recorded
instead of aggregate edge counts, the correct hot path
ABDEF was found for the CFG in Figure 2a.
The additional accuracy provided by path profiling can
be beneficial for certain optimizations based on run-time
profile information.
Young and Smith demonstrated
improvements that can be made to superblock schedulers
using path profiling [29, 26]. A superblock scheduler
attempts to improve program performance by increasing the
amount of instruction-level parallelism (ILP) extractable
7. Related Work
0.75’’
No page numbers
In this section, we overview the research that relates to
program optimizations, phases and the use of profiling.
7.1 The Deco Project at Harvard and Hewlett Packard’s
Dynamo
The Deco project [14] at Harvard University and Hewlett
Packard's Dynamo[4] are feedback-direct optimization
systems which attempt to create executables which can
adapt to their run-time environment. Each system uses a
different mechanism for detecting changes in phase. The
current prototype of Deco records Ball and Larus acyclic
paths[6] and stores them in a hash table(profile table). An
execution count is maintained for each entry in the hash
table. Deco uses an interrupt mechanism to examine the
hash table and determine which path to optimize based on a
threshold execution count. Each optimized piece of code is
put into a fixed sized optimization cache. After the hash
table is examined, each entry's execution count is zeroed. If
the cache is full, a random entry is evicted in order to deal
with program phases.
Dynamo uses a heuristic called MRET (Most Recently
Executed Tail)[4] for identifying hot traces. Within Dynamo
backward taken branches represent the end of a trace. The
target addresses of these branches identify the start of new
traces. An execution count is associated with each of these
target addresses. If this execution count exceeds some
threshold value, dynamo begins to record the instruction
stream until another backward taken branch is encountered.
The trace is optimized and placed into a cache indexed by
the target address that starts the trace.
Subsequent
encounters of the target address result in hits to the cache.
The target address is replaced by the address of the
optimized trace. Once the trace ends control is returned to
the Dynamo interpreter which starts the tracing process
again. To adapt to changes in phase, Dynamo flushes the
trace cache whenever the rate of trace creation surpasses
some threshold value. This flushing strategy attempts to
readjust optimizations to changes in branch phase.
7.3 Adaptive Parallelism
Even though adaptive loop transformations for parallel
computations can provide significant performance speedups,
existing adaptive techniques waste processor resources. For
a particular parallel computation, speedups gained by
adding more processors could level off or possibly decrease.
Hall and Martonosi showed that the behavior of some loops
from the Specfp95 and NAS benchmark suites may go
through phases where there may be insufficient levels of
parallelism so additional processors may not help improve
performance[15]. In some cases a serialized version of a
loop might perform better.
8. Conclusions
We investigated the impact of phased behavior on
various C++ benchmarks. We showed that the conditional
branches within the C++ benchmarks exhibit a significant
amount of phased behavior.
We observed a mean
percentage of phased behavior of 13% at a granularity of
1000, 13.7% at a granularity of 500, and 17% at a
granularity of 100. We believe this indicates a great deal of
opportunity for an optimizer that can exploit path profiling.
We have described techniques to improve the accuracy of
our work, as well as approaches to extend the work for
analyzing phased behavior in conditional branches.
References
[1]
[2]
[3]
7.2 Time Varying Behavior
[4]
Sherwood and Calder looked at phases of program
behavior that vary over time[23]. They looked at how the
instructions executed per cycle, branch prediction rates,
address prediction rates, cache miss rates, and value
prediction rates influence each other and change over the
lifetime of the SPEC95 benchmark suite. They used the
SimpleScalar [9] simulator to record statistics for every 100
million committed instructions. Each point (for each 100
million instructions committed) obtained was graphed to
view trends over time and to look for cyclic behavior. The
cyclic behavior of the SPEC95 benchmarks was used to
reduce the amount of time needed to get an accurate picture
of program behavior. The graph was used to determine the
length of a program cycle.
[5]
[6]
[7]
0.75’’
G. Aigner and U Holzle, “Eliminating Virtual
Function Calls in C++ Programs”, in Proc. of
the European Conference on Object Oriented
Programming, 1996.
G.
Albert, “A Transparent Method for
Correlating Profiles with Source Programs”, in
Proc. of the Second Workshop on FeedbackDirected Optimization(FDO), Nov. 1999.
J. Anderson, “Continuous Profiling: Where
Have All the Cycles Gone”, in Proc. of the
Sixteenth ACM Symposium on Operating System
Principles, Oct. 1997, pp. 1- 14.
V. Bala, E. Duesterwald and S. Banerjia,
“Dynamo: A Transparent Dynamic Optimization
System”, in Proc. of the ACM SIGPLAN
Conference on Programming Language Design
and Implementation(PLDI), June, 2000.
T. Ball and J. R. Larus, “Optimally Profiling and
Tracing Programs”, ACM Trans. on Programming Languages and Systems, July 1994.
T. Ball and J. R. Larus, “Efficient Path
Profiling”, in Proc. of MICRO-29, Dec. 1996.
T. Ball, P. Mataga and M. Sagiv, “Edge
Profiling versus Path Profiling: The Showdown”, In Symposium on Principles of Programming Languages, Jan. 1998.
No page numbers
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
F.-B. Bjorn and J. Maloney, “The deltablue
algorithm: An incremental constraint hierarchy
solver”, in Proc. of the Eighth Annual IEEE
Phoenix Conference on Computers and Communications, 1989.
D. Burger and T. M. Austin, “The SimpleScalar
Toolset, Version 2.0”, in University of
Wisconsin-Madison Technical Report, CS-TR1997-1342, June 1997, pp. 128-137.
B. Calder, D. Grunwald and B. Zorn,
“Quantifying Behavioral Differences Between C
and C++ Programs”, in Journal of Programming
Languages, 1994.
R. F. Cmelik and D. Keppel, “Shade: A Fast
Instruction-Set Simulator for Execution Profiling”, in Proc. ACM SIGMETRICS Conference on the Measurement and Modeling of
Computer Systems, May 1994, pp. 128-137
E. Cox, “Fuzzy Fundamentals”, IEEE Spectrum,
Oct. 1992, pp. 58-61.
E. Duesterwald and V. Bala, “Software Profiling
for Hot Path Prediction: Less is More”, in Proc.
Ninth International Conference on Architectural
Support for Programming Languages and
Operating Systems, Nov. 2000.
E. Feigin, “A Case for Automatic Run-time
Code Optimization”, A Case for Automatic Runtime Code Optimization. Bachelor of Arts
Thesis, Harvard College, April 1999, 1999.
M. W. Hall and M. Martonosi, “Adaptive
Parallelism in Compiler Parallelized Code”, in
Concurrency: Practice and Experience, vol. 10,
no. 14, pp. 1235-1250, 1998.
W. Hwu, “The Superblock: An Effective
Technique for VLIW and Superscalar Compilation”, in Journal of Supercomputing, vol. 7, pp.
229-248, Jan. 1993.
G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy
Logic - Theory and Applications”, Upper Saddle
River, NJ: Prentice Hall PTR, 1995.
J. R. Larus and E. Schnarr, “EEL: MachineIndependent Executable Editing”, in Proc. of the
SIGPLAN Conference on PLDI, vol. 30, no. 6,
pp. 291-300, 1995.
D. C. Lee, P. J. Crowley, J.-L. Baer, T. E.
Anderson and B. N. Bershad, “Execution
Characteristics of Desktop Applications on
Windows NT”, in ISCA, 1998 pp. 27—38.
M. A. Linton, J. M. Vlissides and P. R. Calder},
“Composing User Interfaces with InterViews”,
IEEE Computer, vol. 22, no. 2, pp. 8-22, 1989.
T. Romer, G. Voelkner, D. Lee, A. Wolman, W.
Wong, H. Levy and B. Bershard, “Instrumentation and Optimization of Win32/Intel
Executables Using Etch.”, USENIX Windows
NT Workshop, Seatle, WA, August 1997.
S. Savari and C. Young, “Comparing and
[23]
[24]
[25]
[26]
[27]
[28]
[29]
Combining Profiles”, in Proc. Second Workshop
on Feedback-Directed Optimization (FDO),
Nov. 1999.
T. Sherwood and B. Calder, “Time Varying
Behavior of Programs”, in UC San Diego
Technical Report UCSD-CS99-630, Aug. 1999.
L. Smith and C. Laird, “Android, Open Source
Scripting for Testing & Automation”, in Dr.
Dobbs Journal, no. 326, pp. 58-61, July 2001.
M. D. Smith, “Extending SUIF for Machinedependent Optimizations”, in Proc. of the First
SUIF Compiler Workshop, 1996, pp. 14-25.
M. D. Smith, “Overcoming the Challenges to
Feedback-Directed Optimization”, in Proc. of
the ACM SIGPLAN Workshop on Dynamic and
Adaptive Compilation and Optimization
(Dynamo'00), Jan. 2000.
D. W. Wall, “Predicting Program Behavior
Using Real or Estimated Profiles”, in Proc. of
the ACM SIGPLAN '91 Conference on PLDI,
vol. 26, pp. 59-70, June 1991.
Z. Wang, K. Pierce and S. McFarling, “BMAT - A Binary Matching Tool”, in Proc. of the
Seond Workshop on Feedback-Directed
Optimization, 1999.
C. Young and M. D. Smith, “Better Global
Scheduling Using Path Profiles”, in Proc. 30th
Annual IEEE/ACM Intl. Symp. on Microarchitecture, Nov. 1998.
Brian A. Malloy is an associate
professor in the department of
Computer Science at Clemson
University.
Dr. Malloy's research
focus is software engineering and
compiler technology.
He has
investigated issues in software
validation, testing and program
representations to facilitate validation and testing. He has
applied software engineering to parser development,
especially applied to the construction of a parser front-end
for C++. Dr. Malloy has given presentations at national and
international conferences and workshops. He is an active
member of the Association for Computing Machinery
(ACM) and the IEEE Computer Society. Dr. Malloy has
reviewed papers for IEEE Transactions for Parallel and
Distributed Systems, Journal of Software, Practice and
Experience, Journal of Parallelism and IEEE Transactions
on Software Engineering.
Black &
White picture
Download