Architectural Support for Online Program Instrumentation

advertisement
Architectural Support for Online Program
Instrumentation
Kapil Vaswani, Y. N. Srikant and Matthew J. Thazhuthaveetil
Department of Computer Science and Automation
Indian Institute of Science
Bangalore, India
{kapil,srikant,mjt}@csa.iisc.ernet.in
keeping, additional recompilation or code du-
Abstract— For advanced profile-guided optimizations to be effective in online environ-
plication.
ments, fine-grained and accurate profile infor-
from the SPEC CPU2000 suite show that over-
mation must be available at low costs. In this
heads for call-graph and basic block profiling
paper, we propose a generic framework that
can be reduced by 72.7% and 52.4% respec-
makes it possible for instrumentation based
tively with a negligible loss in accuracy.
profilers to collect profile data efficiently, a task
Keywords—profiling, instrumentation, profile-
that has traditionally been associated with high
overheads.
Our simulations with benchmarks
guided optimization
The essence of the scheme is to
make the underlying hardware aware of instru-
I. Introduction
mentation using a special set of profile instructions and a tuned microarchitecture. We pro-
anism that can be used by the runtime to
O
manage the execution of profile instructions
model in general and runtime systems in par-
by switching profiling on and off.
We pro-
ticular has been a target of much research re-
pose profile flag prediction, a hardware opti-
cently. It is in this context that profile-guided
mization that complements selective instruc-
optimizations have emerged as important tools
pose selective instruction dispatch as a mech-
tion dispatch by not fetching profile instruc-
VERCOMING the performance penalties
associated with the dynamic compilation
in our bid to achieve performance comparable
tions when the runtime has turned profiling
to that under the traditional compilation model.
off. The framework is light-weight and flexi-
In online environments such as Java runtimes,
ble. It eliminates the need for expensive book-
profile-guided optimizaions can be used to im1
prove program performance based on profile infor-
tion dispatch and profile flag prediction mecha-
mation available during the same run of the pro-
nisms. With support from the compiler, selective
gram. These optimizations can adapt to changes
instruction dispatch allows the runtime to dynam-
in program behavior detected by monitoring the
ically control the execution of profile instructions
program throughout the length of its execution.
by switching between profiling and non-profiling
However, the applicability of profile-guided opti-
modes of execution and hence regulate the profil-
mizations is governed by a complex cost-benefit
ing activity in the program. Profile flag prediction
trade off. Collecting fine-grained profile informa-
is a hardware optimization that fetches or skips a
tion such as call-graph, edge and path profiles us-
group of profile instructions based on the likeli-
ing instrumentation, for even a subset of methods,
hood of their execution. Our framework (1) can
involves substantial costs, both in terms of exe-
be easily incorporated into existing architectures
cution time and profiler design complexity. This
without substantial hardware costs, (2) supports
factor has precluded the use of advanced profile-
all types of instrumentation based profiles and (3)
guided optimizations such as profile-guided basic
involves no code duplication or code management
block scheduling, loop unrolling and method inlin-
overheads, thus making advanced profile-guided
ing that rely on fine-grained profile information.
optimizations feasible in cost-sensitive dynamic
recompilation environments.
In this paper, we propose a generic framework
that reduces the overheads associated with online
II. Background and Motivation
instrumentation through a higher level of interaction between the compiler, the runtime and the
underlying hardware.
Although most traditional profiling techniques
The framework incorpo-
are based on software mechanisms and instrumen-
rates instruction set support for online instrumen-
tation, there has been a flurry of activity over the
tation in the form of profile instructions. These
recent years on hardware support for profiling.
instructions allow the processor to explicitly iden-
Conte et. al [1] propose an approach that uses
tify instrumented code, provide the runtime with
kernel mode instructions to sample the branch
mechanisms to control the profiling activity and
handling hardware present in most commercial
to optimize the process of profiling in a manner
processors at every kernel entry point to construct
transparent to the runtime.
weights of a Weighted Control Flow Graph. Sev-
We propose dedicated architectural support for
eral mechanisms that collect profile data by mon-
online profiling in the form of selective instruc-
itoring events at various stages of the processor
2
pipeline have also been proposed[2][3][4]. Zilles
volve complex analysis[7][8].
and Sohi[5] suggest a solution that advocates ded-
•
Recompilation space and time overheads: Some
icated hardware for profiling in the form of a
instrumentation strategies create duplicate in-
programmable profiling co-processor that receives
strumented copies of code and mechanisms for
profile samples from the main processor, processes
transferring control to the instrumented copies
it and feeds it back to the main processor.
when profiling is to be performed.
In general, hardware profilers have advantages
•
of low overheads and non-intrusiveness but they
Overheads due to instruction and data cache
pollution: The presence of the instrumentation
are restricted by the profiling capabilities built
causes code expansion and could degrade instruc-
into the processor hardware and cannot cater to
tion cache performance. Also, the instrumented
the range of profiling requirements raised by dy-
instructions data accesses could adversely effect
namic environments of the present and future.
the data cache hit ratio.
Moreover, these techniques are usually platform
•
dependent, which makes it difficult to develop
Overhead of running the instrumented code:
This is usually the main source of profiling over-
profilers for runtime systems that support mul-
heads. We further break this overhead into (i) the
tiple architectures.
overhead associated with fetching profile instruc-
We therefore believe that software techniques
tions and (ii) the overhead of executing them.
such as instrumentation will continue to play
•
an important role in the design of profilers for
If the environment provides a mechanism to dy-
namically control instrumentation, its overheads
dynamic environments. It has been recognized
and complexity of add to those of the profiler.
that instrumentation based profilers must be complemented with simple mechanisms that allow
runtimes to control instrumentation overheads[6].
Several existing techniques attempt to reduce
These overheads can be broken down into:
the execution overheads of instrumentation by
Overheads associated with determining where,
stopping instrumentation code from executing.
what and how to instrument: Instrumentation
Traub et. al[9] suggest a mechanism for dynami-
procedures for most types of profiles do not re-
cally hooking and unhooking instrumentation by
quire significant time and effort.
However, in
code patching. In Convergent Profiling[10], the
cases such as edge and path profiles, increasing
profiling code is conditioned on a boolean test
the efficiency of the instrumented code might in-
before proceeding with data collection. During
•
3
execution, the boolean is set based on the conver-
III. Reducing profile execution
gence of the profile. Arnold and Ryder[6] propose
overheads using selective
a software based framework for instrumentation
instruction dispatch
sampling that uses code duplication and counter
It is well-recognized that most programs show
based sampling to switch between instrumented
a well-defined phased or even periodic behav-
and non-instrumented code. This approach re-
ior[11]. So, instrumented code need not run over
quires no hardware support but involves signifi-
the entire duration of the profiling interval to
cant recompilation space and time overheads. A
gather data that is adequately representative of
drawback common to these techniques is the over-
the actual program profile. Since most profile-
head of the instrumentation management mecha-
guided optimizations use profile information to
nism itself. While the approach in [9] involves
drive heuristics and decision making, they can
book keeping and code patching, [10] incurs over-
function as effectively with such representative
heads in maintaining individual booleans for ev-
profiles.
ery instruction being profiled. The instrumenta-
This observation has been the basis
for sampling based profiling. In a similar vein,
tion sampling framework[6] report a framework
we propose selective instruction dispatch infras-
overhead of 4.9% due to the presence of checks on
tructure that provides the runtime environment
method entries and back edges, a 34% increase in
with the means to sample the execution of instru-
recompilation time and requires 285KB of addi-
mented code by periodically switching profiling on
tional code space.
or off.
In our view, runtime environments should be
In selective instruction dispatch, the switching
able to concentrate on the eventual use of pro-
between profiling and not profiling happens in
file information rather than having to spend pre-
hardware under software control. Central to the
cious cycles dealing with the management of pro-
implementation of selective instruction dispatch
file overheads. In this paper we propose a scheme
is the profile flag, which is supported by the pro-
where instrumentation overhead management is
cessor as part of its dynamic state. The state of
a service implemented in the hardware. The run-
this flag determines the current profiling mode of
time environment utilizes this service depending
the processor. When the software sets the pro-
on its requirements. As our results show, this fa-
file flag, the processor runs in the profiling mode
cility does not come at the expense of profiling
during which it fetches and executes instrumented
accuracy or flexibility.
code as normal instructions. When the profile flag
4
is reset, profiling is switched off and the proces-
ment code. In this study, we use annotations for
sor shifts to the non-profiling mode. During this
lui, lw, sw, add , jal and mov instructions; these
mode, instrumented code is ignored and dropped
are sufficient for the kinds of profiling we studied.
from the dynamic instruction stream.
A.2 Profile Control Instructions (PCI):
The rest of this section describes instruction set
These instructions operate on the profile flag
and architectural extensions required to support
selective instruction dispatch.
and thus control the profiling activity during exe-
We also discuss
cution. The setpf instruction sets the profile flag
modifications in the instrumentation procedures
while resetpf resets the flag. The pushpf opera-
that allow the runtime environment to control and
tion pushes the value of the profile flag onto the
tune profiling activity.
runtime stack. poppf pops a value from the top
A. Instruction Set Extensions
of stack and sets the profile flag accordingly.
A.1 Profile Instructions (PI):
B. Microarchitectural Extensions
In order to selectively dispatch instrumented
We propose the following extensions to the base
code, the processor must be able to distinguish
architecture in order to support selective instruc-
between instrumentation code and ordinary code.
tion dispatch.
This distinction can be made explicit by extendB.1 Extensions to decode/dispatch logic:
ing the ISA with a set of instructions referred to as
profile instructions. These instructions are vari-
The decode stage of the processor pipeline is
ants of the subset of the original instruction set
where most of the selective dispatch logic is im-
that are generally used in profiling. The differ-
plemented. The decode logic is extended to sup-
ence between the original instruction opcode and
port profile instructions and profile control in-
its profile version could be a single bit, an annota-
structions in the following manner.
tion or a prefix. The processor might even support
and resetpf instructions set and reset the pro-
a completely new set of opcodes depending on the
file flag during instruction decode. These instruc-
instruction set and decode space available. Using
tions consume decode bandwidth but are not dis-
annotations or a bit in the opcode would simplify
patched and hence have minimal overhead. Fur-
the identification of these instructions. The in-
thermore, the decode stage identifies profile in-
strumentation system uses these special instruc-
structions and processes them according to the
tions instead of ordinary instructions to instru-
value of the profile flag. If the decode stage comes
5
The setpf
across a profile instruction and finds the profile
start fetching from the fall-through address.
flag set, the instruction is decoded and moved to
B.2 Branch prediction for profile instructions
the reorder buffer like any ordinary instruction.
However, if the profile flag is off, the profile in-
The branch prediction mechanism must be
struction is not dispatched. This action of fil-
modified to minimize the number of profile mis-
tering profile instructions from the dynamic in-
fetches. The branch predictor must not only ac-
struction stream at the decode stage is referred
curately predict the direction a profiling branch
to as selective instruction dispatch. Note that a
instruction might take, but also whether it is
small overhead is incurred even when a profile in-
likely to be ignored by the selective instruction
struction is ignored in this way since the instruc-
dispatch hardware. We evaluated three strategies
tion has already consumed fetch and decode band-
that make the branch predictor aware of profile in-
width.
structions, namely branch prediction with ignored
Independent of whether the profile instruction
instruction updates, profile flag based branch pre-
has been dispatched or ignored, the decode stage
diction and prediction bit based branch prediction.
checks for a possible profile mis-fetch. A profile
We describe prediction bit based branch predic-
mis-fetch is said to have occurred if the profile in-
tion here; a description of the other approaches
struction satisfies the following three conditions:
can be found in [12].
In prediction bit based branch prediction, the
1. The profile instruction is also a control transfer
fetch stage maintains a prediction bit that it set
instruction
and reset by setpf and resetpf instructions when
2. It was predicted taken by the branch predictor
these instructions are fetched. The bit is differ-
3. It is being dropped by the decode stage due to
ent from the profile flag since the setpf and re-
the value of the profile flag.
setpf instructions that modify it might have been
fetched speculatively, only to be squashed later.
In this situation, the fetch engine would already
If the prediction bit is off, profile instruction is
have speculatively fetched instructions from the
predicted not taken. However if the bit is set, we
predicted address whereas the instruction that
fall back on the normal branch prediction hard-
should be processed next lies in the fall-through
ware. This approach requires additional hardware
path. The mis-fetch is rectified by squashing the
at the fetch stage, but unlike the other two strate-
fetch queue and re-initializing the fetch engine to
gies, this approach has a high prediction accuracy
6
for both profile instructions and ordinary control
instrumentation system must also insert PCIs in a
instructions. We use this strategy in simulation
manner dictated by the granularity of control re-
studies reported in Section VII.
quired over profiling. A setpf /resetpf instruction
is typically precedes a unit of code (a function, a
B.3 Dependency Logic
loop, a basic block or an individual instruction)
The profile flag creates additional dependencies
and controls the profiling activity within the unit.
between PIs and PCIs. The dependency analysis
The exact manner of placing the setpf /resetpf in-
logic must be extended to take these additional
struction depends on the mechanism used to trig-
dependencies into account.
ger the switching of profiling modes, which we
A case that needs
special consideration is the RAW dependency be-
discuss later in this section.
tween a poppf and any following instruction that
poppf instructions save and restore the state of
reads the profile flag. Since the profile flag acts
the profile flag. Proper placement of these in-
at the decode stage and the result of the pop will
structions is required to ensure that the scope of
be available at best in the next cycle, maintaining
setpf /resetpf instructions is restricted to their in-
this dependency might involve stalling the decode.
tended areas. For example, a pushpf -poppf pair
Other profile flag dependencies are taken care of
at the callee’s method entry and exit ensure that
by the serial order of decode.
an setpf /resetpf instruction in the callee does not
The pushpf and
interfere with profiling context in the caller. Sam-
B.4 Branch mis-prediction recovery
ple methods that illustrate these modifications for
Similar to other components of processor state,
basic-block and call-graph profiling are included
the profile flag is saved at points where the pro-
later Section VI.
cessor moves into speculative mode. The recovery
The mechanism that triggers the switching of
mechanism restores the profile flag to the saved
profiling modes can either be built into the in-
value when a mis-speculation is detected.
strumented methods or implemented externally
C. Compiler and runtime extensions
in the runtime. Arnold and Ryder[6] propose a
The instrumentation system and the runtime
scheme that involves placing checks at method en-
environment must be modified to exploit selective
tries and back-edges that transfer control to the
instruction dispatch. Recall that the instrumen-
instrumented code based on a condition in the
tation system must use profile instructions instead
check. A similar check implemented using setpf
of ordinary instructions to instrument code. The
and resetpf instructions would resemble the code
7
if the profile instructions involve calls to external
if (condition code check)
setpf
else
resetpf
routines.
As with any sampling technique, selective instruction dispatch involves a trade-off with the
Fig. 1. Sample implementation of a check-based trigger using PCIs
precision of the profile data. The runtime must
shown in Figure 1.
for a representative and meaningful profile to be
ensure that profiling is switched on long enough
collected. Some salient features of selective in-
To implement the switching mechanism exter-
struction dispatch are (1) the extensions required
nally, the instrumentation system places a sin-
to support selective instruction dispatch do not
gle setpf /resetpf instruction before units of code.
effect the processor’s critical path, (2) selective
The runtime maintains for each instrumented
instruction dispatch decreases resource contention
method, a set of addresses where setpf /resetpf in-
since a number of profile instructions that might
structions have been placed. To switch profiling
have prevented other instructions from using pro-
off for a unit of code, the runtime dynamically
cessor resources such as memory ports, registers
swaps the associated setpf instruction with an re-
and functional units are no longer dispatched and
setpf and vice versa. The swapping routine could
(3) unlike the schemes surveyed in Section II, the
be invoked either by signals, hardware based in-
technique involves no additional recompilation or
terrupts or a dedicated thread. Independent of
code duplication and runs in a manner transpar-
the trigger mechanism used, the setpf /resetpf in-
ent to the compiler.
struction executed on the next entry into the unit
of code dictates the manner in which profile in-
IV. Reducing fetch overheads using
profile flag prediction
structions would be handled by the processor.
This amalgam of minor extensions provides the
Selective instruction dispatch of profile instruc-
runtime with fine-grained control over the execu-
tions helps in reducing the overheads of profile
tion of profile instructions. To reduce overheads,
instruction execution. We next address the prob-
the runtime uses a triggering mechanism to peri-
lem of trying to avoid fetching profile instructions
odically switch profiling on and off. When profil-
that are going to be ignored due to selective in-
ing has been turned off, profile instructions will
struction dispatch. These unnecessary instruction
not contribute to the execution overheads. As we
fetches can be a significant overhead to certain
shall see, the savings can be even more significant
kinds of profiles. Basic block profiling is one such
8
case. Here the profile instructions are largely in-
has one entry per profile block consisting of the
dependent of the rest of the code and hence have
following four fields:
low execution overheads, particularly in out-of-
1. A tag that consists of higher order bits from
order processors where they add to the available
the address of the leader instruction.
instruction level parallelism. It is the fetching of
2. The address of its block follow.
these instructions, usually large in number, that
3. A prediction bit that indicates the value of the
is expensive. Fetching and ignoring instructions
profile flag the profile block is likely to see when
has the effect of shrinking the instruction win-
fetched next.
dow seen by the decode stage. It also leads to
an increase in pressure on the i-cache. We pro-
4. A lock bit initialized to 1, which disallows reads
pose profile flag prediction as a solution to this
from this table entry until reset to 0.
problem. This mechanism predicts the value of
The block descriptor table is indexed using lower
the profile flag that a profile instruction will see
order bits from the address of the leader instruc-
during decode. This prediction is used to skip
tion. The table maintains descriptions of profile
over groups of profile instructions that are un-
blocks seen during decode. It records past history
likely to be dispatched by the selective instruction
of whether profile block was dispatched or ignored
dispatch hardware, thereby eliminating unneces-
when previously encountered.
sary fetches.
In order to detect profile blocks in the instruc-
We define a profile block as a sequence of profile
tion stream without compiler support, the decode
instructions that appear statically together. The
stage maintains a special internal register called
leader of a block is defined as the first instruction
the leader address register that keeps track of the
in that profile block. The leader ”dominates” all
leader address of the profile block being processed.
other instructions in its profile block. The block
Ordinary instructions reset the register. When
follow is the instruction that immediately follows
the decode comes across a profile instruction and
the profile block in static code. A profile block
finds the leader address register reset, it identi-
is uniquely defined by its leader and the block
fies the instruction as a leader and updates the
follow.
register with its address. Profile instructions that
A. The Profile Flag Predictor
follow the leader do not alter the register.
As shown in Figure 2, the profile flag predictor
The profile flag predictor interfaces with the
uses a hardware block descriptor table. The table
pipeline stages in a fashion similar to the branch
9
predictor. For every profile instruction encountered, the decode updates the profile flag predictor with the current instruction address, the profile flag value and the contents of the leader address register. The leader instruction address is
used to index into the block descriptor table. If no
entry for this address is found, a new table entry is
created only if the profile flag is off. This ensures
that control will flow through the current profile
block even in the presence of profile branches and
the predictor will eventually be updated with the
last instruction of the block. The block follow address is initialized to the address of the instrucFig. 2. Profile flag predictor lookup: For a given
leader instruction, the lookup returns the corresponding block follow address.
tion following the leader (subsequent profile instructions update this field). The prediction bit
is initialized to predicted reset and the lock bit is
value of the profile flag.
set to disallow any reads for this entry until the
During instruction fetch, the profile block pre-
predictor has made sure of having seen the entire
dictor is consulted to check if the fetched instruc-
profile block.
tion is the leader of a block of profile instructions.
If an entry for the leader instruction address
If so, the predictor uses the prediction bit cor-
is found in the descriptor table, the predictor
responding to the block to predict whether the
updates the block follow address if the current
block is likely to be executed or not. If it pre-
instruction address is greater than the existing
dicts that the profile block is likely to be ignored
block follow and the lock bit is set. The lock bit is
and finds the lock bit 0, the predictor returns the
reset to 0 if the current instruction address is the
block follow address corresponding to the leader
same as the leader instruction address. These op-
address. The current fetch is dis-continued and
erations ensure that the entry for a profile block
the fetch engine is re-directed to the follow ad-
is not read until we have seen the entire profile
dress, thus skipping over the block of profile in-
block and correctly determined block follow. The
structions. The profile flag prediction overrides
prediction bit is always updated to the current
branch prediction if the leader happens to be a
10
control flow instruction. The leader instruction is
in the profile block and remove stale entries from
moved to the fetch-decode queue in all cases.
the block descriptor table.
A.1 Profile flag mis-prediction and recovery
V. Support for simultaneous profiling
The decode stage detects a profile flag mis-
The selective instruction dispatch logic can eas-
prediction if it comes across a leader instruction
ily be scaled to allow the runtime to instrument a
whose profile block has been predicted ignored,
but the profile flag is currently on.
method for different types of profiles and still con-
The mis-
trol each type of profiling independently of the
prediction is corrected by re-initializing the fetch
other. To support n simultaneous profiles, the
engine to fetch remaining instructions in the pro-
processor must maintain n profile flags, possibly
file block.
in a dedicated profile flag register where each flag
A.2 Profile flag prediction in dynamic environments
controls one type of profiling. The instruction set
would be similarly extended to support n differ-
Profile flag prediction in its current form does
ent annotations, one for each profile type. The
not validate a correct prediction that involves
skipping instructions.
pushpf, poppf, setpf and resetpf instructions take
It assumes that profile
operands specifying a profile flag they need to op-
blocks do not change over time. This assumption
erate on. Similarly, the fetch stage maintains n
does not hold in dynamic recompilation environ-
prediction bits and the branch predictor uses the
ments (where it is not uncommon to have a mem-
prediction bit corresponding to the profile instruc-
ory management policy for compiled code) or for
tion for which a prediction is being made. For the
self modifying code. Here, a profile block could be
purpose of this study, we restrict our attention to
replaced by another block of a different length but
a single type of instrumentation at a time.
at exactly the same location. The predictor might
VI. Simulation Methodology
cause the processor to skip genuine instructions
or execute profile instructions it is not supposed
We use the SimpleScalar[13] simulator with
to. This problem can be corrected if the compiler
SPEC CPU2000 benchmarks[14]. The base ma-
passes information about the length of every pro-
chine configuration (Table I) represents a typical
file block to the processor, either as part of the
4-way out-of-order Superscalar processor[15]. The
profile instructions or through a special PCI in-
benchmarks were compiled using the gcc cross
struction that precedes every profile block. The
compiler targeted to the Simplescalar architecture
profile flag predictor could then detect a change
with compiler flags -O4 -ffast-math -funroll-loops
11
Fetch queue Size
Branch predictor
BTB config
Decode width
Issue width
Commit width
RUU size
LSQ size
Functional Units
32
Bimodal and two-level combined, with speculative update in dispatch, 3 cycle penalty for branch misprediction
256 entries, 2-way
4
4
4
64
64
4 integer ALUs, 2 integer
multipliers, 2 FP adders, 2
FP dividers
L1 d-cache
L1 d-cache latency
64KB, 2-way
2 cycles
L1 i-cache
L1 i-cache latency
L2 unified cache
L2 cache latency
Memory latency
Memory bandwidth
TLB
64KB, 2-way
1 cycle
2 MB, 4-way
10 cycles
100 cycles
32 bytes
128 entry, 4-way associative, 8KB page size, 50 cycle
penalty
TABLE I
Base processor configuration
for SPEC Integer benchmarks and -O4 for SPEC
mented for basic block profiling under our frame-
floating-point benchmarks. The compiler was ex-
work. Each basic block is instrumented at entry
tended with implementations for basic block and
with a set of 5 instructions (the load and store
call-graph profiling. We used the SPEC test in-
are expanded into two instructions by the assem-
puts for our simulations, except for mcf where we
bler) to increment a counter corresponding to the
used the train input as the test input is too short.
block. Since all computation related to profiling
Detailed simulations were conducted for windows
happens inline without control being transferred
of 100 million instructions at various points span-
outside the function, these instructions form an
ning the entire length of program execution.
example of what is known as inline instrumen-
A. Case Studies: Basic Block and Call-Graph
profiling
tation. The pushpf and poppf instructions ensure that the state of the profile flag is restored
Basic block and call-graph profiles[16] are rep-
to its original value in the caller. The presence of
resentative of classes of instrumentation based
the resetpf instruction after the prologue allows
profiles that are widely used as inputs to adaptive
the runtime to control profiling at the method-
optimizations. These profiles find application in
level. The resetpf instruction is followed by the
optimizations such as profile-guided basic block
block profile initialization code. This piece of code
placement and scheduling [17], dead code elimi-
is responsible for registering the function’s basic
nation and method inlining[18][19].
block counters with a routine that prints out the
block profile on program exit. We exclude the ini-
Figure 3 shows the layout of a function instru12
function entry:
<method prologue>
// push the profile flag, save caller context
pushpf
// reset the profile flag
// runtime keeps track of this address, patches
// to setpf to turn profiling on and vice versa.
resetpf
<block profile init code>
...
LPBX1:
// basic block instrumentation instructions
// /b annotation indicates profile instructions
lw/b
$26, ...
add/b
$26, $26, 1
sw/b
$26, ...
...
// pop profile flag and restore caller context
poppf
<method epilogue>
Fig. 3. Method instrumentation for basic block profiling
tialization code from overhead computations since
mechanisms.
a dynamic environment is unlikely to instrument
B. Simulating a runtime environment
methods with such code.
We extend SimpleScalar with a simulated runThe instrumentation for call-graph profiling
time environment. Our objective is to simulate a
shown in Figure 4 involves a call to an external
runtime that identifies hot methods, i.e. meth-
routine. The routine uses the two most recent re-
ods that consume most of the execution time,
turn addresses to update the call-graph with an
and then instruments and collects profiles only for
arc that corresponds to the current caller-callee
these methods. A complete description of how
pair. This is an example of out-of-line instrumen-
this was achieved and other details such as the
tation[20].
modified simulator instruction counting strategy
and the profile collection mechanism can be found
As we shall shortly see, not only do inline
in [12].
and out-of-line instrumentation have a different
component-wise distribution of overheads, but
The simulated runtime environment uses an
they respond differently to our overhead reducing
external trigger mechanism based on dynamic
13
function entry:
<method prologue>
// push the profile flag, save caller context
pushpf
// reset the profile flag
resetpf
// save caller return address
mov/b
$1, $31
// call external routine to update call-graph
mcount
jal/b
// restore caller return address
mov/b
$31, $1
...
// pop profile flag and restore caller context
poppf
<method epilogue>
Fig. 4. Method instrumentation for call graph profiling
patching to toggle profiling. To enable profiling
C. Comparing profiles
during detailed simulation, the runtime patches
Sherwood et. al in their work on character-
the resetpf instructions in 10 hottest methods
to setpf.
izing program behavior[21], use Manhattan dis-
Simulations are performed for differ-
tance to compare basic block vectors. We use the
ent sampling window sizes by switching profiling
same distance measure to quantify the similarity
on and off at specific instruction count intervals.
of profiles. Profile vectors are treated as points
Profiling is switched on after every 20 million in-
in n-dimensional space. Each vector is normal-
structions for a duration of 2, 5 and 10 million
ized by dividing each element of the vector by the
instructions, which results in profiling being en-
sum of all elements. The Manhattan distance in
abled for 10, 25 and 50% of the execution window
n-dimensional space is the shortest distance be-
respectively. Execution time overheads are calcu-
tween two points when the only paths taken are
lated relative to the execution time for the binary
parallel to the axes. Given two normalized n-
with no instrumentation which is the No Profiling
dimensional vectors a and b, the Manhattan dis-
configuration. The overheads in the 100% pro-
tance MD between the vectors is given by
filing configuration are essentially the overheads
that a program would incur in the absence of our
framework.
MD =
14
Pn
i=1 | (ai
− bi ) |
The Manhattan distance is a number between 0
tion of branch instruction addresses caused by
and 2, where 0 represents identical vectors. Given
the instrumentation. For benchmarks other than
the Manhattan distance between two profiles, the
ammp and equake, the redistribution has a posi-
relative accuracy is computed as
tive impact on the branch predictor performance
and leads to lower mispredictions. We found that
% Accuracy = (1 − M D/2) * 100
the number of mispredictions for gzip reduce from
VII. Performance Results
939334 to 936988, which actually translates to a
In this section, we first present the overheads
small improvement in performance. Results also
due to the extra instrumentation (PCIs) required
show that the impact of branch address redistri-
to control profiling activity. We then evaluate the
bution on branch predictor accuracy varies with
impact of the selective instruction dispatch mech-
the type of instrumentation.
anism on instrumentation overheads and profile
B. Selective instruction dispatch
accuracy. Finally, we present results on the effec-
Table IV shows the execution time overheads
tiveness of the profile flag prediction mechanism.
and profile accuracy for call-graph profiling with
A. Framework overheads
various sampling window sizes. We observe that
Table II shows the execution time overheads in-
as the sampling window size is reduced, profiling
curred by selected programs from the SPEC2000
overheads decrease proportionally. For instance,
benchmark suite due to the presence of PCIs in
overheads drop by 72.7% from 41.81% to 11.41%
method prologues and epilogues and the average
when profiling at 25%.
number of PCIs executed in a 100 million instruc-
We can draw several other insights from this ta-
tion window. This overhead is computed relative
ble. The column for 100% profiling indicates that
to the execution time of the binary without PCIs.
overheads for call-graph profiling are quite sig-
The overheads for most benchmarks are extremely
nificant, especially for integer benchmarks where
low in spite of a large number of PCIs encoun-
profiling can result in severely degraded perfor-
tered, partially indicative of the nature of these
mance.
instructions and the way they are processed. The
profile hot methods.
average execution time overhead is also a negli-
pends on the number (table III) and locality of
gible 0.19%.
Detailed analysis[12] reveals that
method calls apart from the nature of the appli-
the overheads are also influenced by redistribu-
cation itself. parser, gzip and bzip2 incur substan-
1
This is despite the fact that we only
The actual overhead de-
tial overheads because of a high number of calls
Averages over several 100 million instruction windows
15
Benchmark
197.parser
164.gzip
181.mcf
255.bzip2
179.art
188.ammp
183.equake
Average
Number of
PCIs executed
3125956
3305409
723526
3241298
168
190535
673270
1608595
Execution time
overhead(%)
0.87
-0.02
0.09
0.33
0.02
0.01
0.06
0.19
TABLE II
Number of PCIs executed1 and framework execution time overheads
Benchmark
197.parser
164.gzip
181.mcf
255.bzip2
179.art
188.ammp
183.equake
Number of basic
blocks executed
5921520
8402845
26967411
6640489
16633082
34608741
1941593
Number of calls
to hot methods
1041985
1101803
241175
1080432
56
63511
224423
TABLE III
Number of basic blocks executed1 and calls to hot methods1
and the significant amount of work done to main-
art and ammp.
tain the call-graph. On the other hand, art has
Figure 5 illustrates the variation in instruction
very few calls to hot methods and does not incur
and data cache accesses as a result of sampling for
any profiling overheads. The low average execu-
call-graph profiling. It is evident that the reduc-
tion time overhead of 0.48% when profiling is com-
tion in overheads for parser, gzip and bzip2 is due
pletely disabled is partially due to the small num-
to a substantially lower number of instruction and
ber of profile instructions that are encountered
data cache accesses. When the sampling window
during call-graph profiling. As shown in Figure
size is reduced, fewer instruction and data cache
4, a method instrumented for call-graph profil-
accesses result due to a higher number of profile
ing has only three additional profile instructions,
instructions being ignored and a lower number of
executed once per method call. Also, fewer mis-
calls to the external mcount routine.
predictions due to the redistribution of branch inThe effects of sampling on the accuracy of
struction addresses contributes to the low over-
the call-graph profile are also varied. Sampling
head, even improving the overall performance in
causes a negligible drop in profile accuracy for
16
Benchmark
197.parser
164.gzip
181.mcf
255.bzip2
179.art
188.ammp
183.equake
Average
0%
profiling
Overhead
1.12
1.01
0.01
1.25
-0.09
-0.02
0.07
0.48
10%
profiling
Overhead Accuracy
10.77
83.09
8.12
79.50
1.12
99.62
13.07
99.96
-0.08
71.82
0.03
99.45
1.71
99.93
4.96
90.48
25%
profiling
Overhead Accuracy
24.04
88.44
18.44
94.60
2.79
99.67
30.44
99.95
-0.08
91.44
0.11
99.62
4.10
99.94
11.41
96.24
50%
profiling
Overhead Accuracy
45.14
92.80
36.06
95.80
5.52
99.88
59.23
99.99
-0.08
97.11
0.23
99.90
8.19
99.97
22.04
97.92
100%
profiling
Overhead
91.33
69.08
10.99
104.66
-0.07
0.49
16.16
41.81
TABLE IV
Percentage overhead and accuracy for call-graph profiling with selective instruction
dispatch
Fig. 5. Call-graph profiling with selective instruction dispatch: level 1 instruction and data cache accesses for
different sampling window sizes.
mcf, bzip2, ammp and equake. parser and gzip
of 90.48%, which increases to a more acceptable
have a large number of call sites with distributed
level of 96.24% when profiling is on for 25% of the
edge counts. Due to the phased behavior in these
instruction window.
programs, sampling at 10% misses some call-sites
Table V illustrates the impact on basic block
alltogether, lowering the profile accuracy to 83%
profiling of sampling based on selective instruc-
and 79% respectively. On the other hand, art
tion dispatch.
has just four hot edges with extremely low edge
age overheads from 26.42% to 16.31%, a drop of
counts. This leads to exaggerated error value even
38.3%. We also observe that the overheads are
if sampling results in a few call-sites being missed.
lower than those for call-graph profiling, mainly
Profiling at 10% results in an average accuracy
due to the characteristics of the instrumented
17
Profiling at 25% reduces aver-
code. In basic block profiling, the instrumented
structions to have a larger impact on the overall
instructions are independent of the rest of the
profiling overheads.
code, do not transfer control to external routines
Apart from a steady increase in overall over-
and show good data locality.
heads, we also notice a marginal rise in the num-
On the other hand, the overheads when pro-
ber of i-cache accesses with the increase in sam-
filing is completely disabled are higher for basic
ple window size. One might expect the number of
block profiling than for call-graph profiling. These
i-cache accesses to remain constant since increas-
overheads also account for a higher fraction of the
ing the sampling window size does not cause any
total overhead. This observation is substantiated
additional instructions to be fetched. We con-
by Figure 6, which plots the variations in instruc-
jecture that this might be a result of an inter-
tion and data cache accesses for different sam-
esting side effect associated with increasing the
pling window sizes. We see that the number of
sampling window size. A larger sample window
data cache accesses decreases almost proportion-
causes a higher number of profile instructions to
ately with the decrease in the sampling window
be executed and hence slows down the flow of in-
size. However, the number of instruction cache
structions through the processor pipeline. Thus
accesses is close to its maximum even when pro-
branch instructions, which would have executed
filing is completely disabled. This is due to the
earlier had the processor ignored more of the pro-
very nature of selective instruction dispatch and
file instructions, take longer to resolve. As a re-
the fact that unlike call-graph profiling, ignoring
sult, the processor speculatively fetches a larger
profile instructions here has no additional ben-
number of instructions during this interval. This
efits apart from their own execution overheads.
causes greater mis-speculation penalties and ex-
Since the number of profile instructions encoun-
plains the marginal increase in i-cache accesses.
tered in basic block profiling is fairly large (table
We also observe that the accuracy of block pro-
III), the overheads incurred even when profiling
files reduces considerably for some benchmarks
is completely turned off are almost 50% of those
when we profile at 10%. The profiles indicate that
when profiling is enabled. We conclude that in
the sampling procedure was not able to capture
scenarios such as basic block profiling where over-
the phased behavior in these benchmarks. How-
heads due to fetching profile instructions are not
ever, when the sampling window size is increased
negligible, selective instruction dispatch must be
to cover 25% of the instruction window, we see a
complemented with the ability to skip profile in-
drop in profile error to less than 5% on average.
18
Benchmark
197.parser
164.gzip
181.mcf
255.bzip2
179.art
188.ammp
183.equake
Average
0%
profiling
Overhead
7.75
16.39
18.98
21.00
23.66
0.69
2.38
12.98
10%
profiling
Overhead Accuracy
8.51
86.49
17.46
82.68
22.98
98.13
21.57
99.92
26.36
79.24
0.71
97.50
2.86
79.54
14.35
89.07
25%
profiling
Overhead Accuracy
9.55
92.62
19.00
94.38
28.82
98.82
22.38
99.94
30.06
97.37
0.74
98.00
3.63
92.27
16.31
96.20
50%
profiling
Overhead Accuracy
11.31
94.91
21.63
96.19
38.46
99.35
23.73
99.98
38.39
98.72
0.80
99.24
4.77
94.11
19.87
97.50
100%
profiling
Overhead
14.96
26.57
57.78
25.85
51.77
0.91
7.08
26.42
TABLE V
Percentage overhead and accuracy for basic block profiling with selective instruction
dispatch
Fig. 6. Basic Block profiling with selective instruction dispatch: level 1 instruction and data cache accesses for
different sampling window sizes.
creasing the size or associativity beyond these val-
C. Selective instruction dispatch with profile flag
prediction
ues does not help significantly.
In evaluating the performance of profile flag
Another predictor design issue was the mini-
prediction, we choose an organization that does
mum profile block size for which skipping fetches
not incur substantial hardware costs while still
is beneficial. If a profile block is too small, it
being able to capture the behavior of most pro-
might be less expensive to not skip fetching the
file blocks encountered. We studied the impact of
block and let the profile instructions be ignored by
block descriptor table size on profile overhead and
the selective dispatch hardware. We studied the
found a 2-way associative table with 64 sets oc-
performance impact of skipping fetches for pro-
cupying approximately 1KB to be adequate. In-
file blocks of various lengths. Our results indicate
19
Benchmark
197.parser
164.gzip
181.mcf
255.bzip2
179.art
188.ammp
183.equake
Average
0%
profiling
4.82
10.18
11.17
14.75
13.11
0.52
0.07
7.96
10%
profiling
5.88
11.90
15.93
15.98
16.82
0.55
1.71
9.84
25%
profiling
7.32
14.37
22.94
17.76
21.86
0.61
4.10
12.52
50%
profiling
9.80
18.58
34.52
20.72
33.22
0.71
8.19
17.39
100%
profiling
14.96
26.57
57.69
25.59
51.39
0.90
16.16
26.31
TABLE VI
Percentage overhead for basic block profiling with selective instruction dispatch and profile
flag prediction
Table VI illustrates the overheads involved in
basic block profiling with selective instruction dispatch supported by profile flag prediction. We
observe that profile flag prediction reduces average overheads from 26.31% to 12.52% when profiling is enabled for 25%, a drop of 52.4%. Profile
flag prediction does not alter the accuracy of profiles since the only instructions skipped are the
ones that would have been ignored by selective
dispatch hardware.
Fig. 7. Basic block profiling with selective dispatch
and profile flag prediction: level 1 instruction cache
accesses for different sampling window sizes.
We also note that when profiling is completely
disabled, profile flag prediction is able to reduce
that skipping is profitable only for profile blocks of
average overheads by 5% from 12.98% to 7.96%.
length 4 or more. We assume that for cases such
As shown in Figure 7, the reduction is predom-
as call-graph profiling where the length of the pro-
inantly due to a decrease in the number of in-
file block is less than 4, the predictor would not
struction cache accesses. As the sample window
initiate skips and prevent smaller blocks from ad-
size increases and profiling is enabled for longer
versely effecting the execution time. We therefore
durations, the profile flag predictor has fewer op-
restrict ourselves to the impact profile flag pre-
portunities to skip profile blocks. The predictor
diction on basic block profiling in the rest of this
has negligible impact on performance when profil-
section.
ing is enabled throughout the execution window.
20
Observe that the average overhead is 7.96%
tecture in the form of instruction set support, a
on average even when profiling is completely dis-
hardware profile flag, modified branch prediction,
abled. Apart from the side-effects of instrumen-
extended decode logic for selective instruction dis-
tation, this overhead can be attributed to the fol-
patch and profile flag prediction that combine to
lowing reasons:
make a highly flexible, light-weight and tunable
framework. Results of simulations for call-graph
Even when a profile block is predicted ignored,
and basic block profiling show that accurate pro-
the leader instruction always goes through to the
files can be collected at substantially lower over-
decode stage. This instruction might eventually
heads by sampling the execution of profile instruc-
be dropped by the decode but would have already
tions using this framework. Average overheads
used up fetch and decode bandwidth.
for naive implementations of call-graph and basic
•
On most architectures, the dis-continuous fetch
block profiling can be brought down from 41.81%
introduced by skipping a profile block has a small
and 26.42% to 11.41% and 12.52% respectively by
penalty associated with it.
profiling for 25% of the execution window while
•
retaining a profile accuracy of over 96%.
These effects are more pronounced in basic
References
block profiling because of the large number of
[1]
profile blocks encountered, resulting in a nonnegligible overhead even when profiling is com-
[2]
pletely disabled.
VIII. Conclusions
[3]
We believe that next generation processors
could help dynamic environments achieve improved performance by catering to some of their
[4]
special requirements. Given the importance of
online profiling and instrumentation in such environments, this paper suggests a profile aware
[5]
microarchitecture that provides dedicated support
for online instrumentation. We propose minor but
[6]
significant extensions to the processor microarchi21
Thomas M. Conte, Burzin Patel, Kishore N. Menezes,
and J. Stan Cox, “Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization” in
International Journal of Parallel Programming, 1996.
Matthew C. Merten, Andrew R. Trick, Christopher N.
George, John C. Gyllenhaal, and Wen mei W. Hwu,
“A Hardware-Driven Profiling Scheme for Identifying
Program Hot Spots to Support Runtime Optimization” in Proceedings of the 26th Annual International
Symposium on Computer Architecture, June 1999.
Jeffery Dean, James E. Hicks, Carl A. Waldspurger,
William E. Weihl, and George Chrysos, “ProfileMe:
Hardware Support for Instruction-Level Profiling on
Out-of- Order Processors” in Proceedings of the 30th
Annual IEEE/ACM International Symposium on Microarchitecture, December 13 1997, pp. 292–302.
Timothy Heil and James E Smith, “Relational Profiling: Enabling Thread-Level Parallelism in Virtual Machines” in Proceedings of the 33rd Annual
IEEE/ACM International Symposium on Microarchitecture, December 2000.
Craig Zilles and Gurindar Sohi, “A Programmable
Co-processor for Profiling” in Proceedings of the 7th
International Symposium on High Performance Computer Architecture, January 2001.
Matthew Arnold and Barbara Ryder, “A Framework
for Reducing the cost of Instrumented Code” in Pro-
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Program Behavior” in Tenth International Conference
on Architectural Support for Programming Languages
and Operating Systems, October 2002.
gramming Languages Design and Implementation, 2022 June 2001, pp. 168–179.
Thomas Ball and James R. Larus, “Optimally profiling and tracing programs,” in ACM Transactions on
Programming Languages and Systems, July 1994, pp.
16(4):1319–1360.
Thomas Ball and James R. Larus, “Efficient Path
Profiling” in Proceedings of the 29th Annual International Symposium on Microarchitecture, 1996, pp.
46–57.
Omri Traub, Stuart Schechter, and Michael D. Smith,
“Ephemeral Instrumentation for Lightweight Program
Profiling” Technical report. Harvard University, 2000.
B. Calder, P. Feller, and A. Eustace, “Value Profiling” in Journal of Instruction Level Parallelism,
March 1999.
Michael D. Smith,
“Overcoming the Challenges
of Feedback Directed Optimization” in Proceedings of the ACM SIGPLAN Workshop on Dynamic
and Adaptive Compilation and Optimization (Dynamo
2000), January 2000.
Kapil Vaswani, Y. N. Srikant, and T. Matthew Jacob,
“Architectural Support for Online Program Instrumentation” Technical Report. Department of Computer Science and Automation, Indian Institute of
Science, February 2003. (Available at http://wwwcompiler.csa.iisc.ernet.in/)
Doug Burger, Todd M. Austin, and Steve Bennett, “Evaluating Future Microprocessors: The SimpleScalar Toolset” Technical Report CS-TR-96-1308.
University of Wisconsin-Madison, July 1996.
T. S. P. E. Corporation, “SPEC CPU2000 benchmarks” http://www.spec.org/cpu2000/.
James E. Smith and Gurindar Sohi, “The Microarchitecture of Superscalar processors” in Proceedings
of the IEEE, December 1995, pp. 83(12):1609–24.
Glenn Ammons, Thomas Ball, and James R. Larus,
“Exploiting hardware performance counters with flow
and context sensitive profiling,” in Proceedings of
the ACM SIGPLAN ’97 Conf. on Programming Languages Design and Implementation, 1997, pp. 85–96.
K. Pettis and R. C. Hansen, “Profile Guided Code
positioning” in Proceedings of the ACM SIGPLAN’90
Conference on Programming Language Design and
Implementation (PLDI), 20-22 June 1990, pp. 16–27.
Mathew Arnold,
“Online Instrumentation and
Feedback-Directed Optimization of Java” Technical Report DCS-TR-469,Department of Computer Science, Rutgers University.
Matthew Arnold, Stephan Fink, Vivek Sarkar, and
Peter F. Sweeny, “A comparative study of static and
profile-based heuristics for inlining,” in Proceedings of
ACM SIGPLAN workshop on Dynamic and Adaptive
Compilation and Optimization. ACM Press, January
2000, pp. 52–64.
Susan L. Graham, Steven Lucco, and Robert Wahbe,
“Adaptable Binary Programs” in Usenix, 1995.
T. Sherwood, E. Perelman, G. Hamerly, and
B. Calder, “Automatically Characterizing Large Scale
22
Download