AUTOMATIC PROFILER-DRIVEN
PR OBABILISTIC COMPILER OPTIMIZATION
by
Daniel Coore
Submitted to tle
I);EPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
in partial fulfillmenlt of the requirements
for the degrees of
BACIIELOR OF SCIENCE
alnd
MASTER OF SCIENCE
at thle
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June, 1994
(Daniel
Coore, 199/1. All rights reserved.
The author hereby grants to MIT permissionlto reproduce and to
distribute copies of this thlesis documellt in whole or in part.
-.
Signature of Author
Dept. of Electrical Elginejng
and Compter
Science, May 15,1994
Certified by
Thomas Knight Jr., Prinlcip/Researfh
Scientist, Thesis Supervisor
Certified by
Dr.e
.
Accepted
ter Oden, Company Supervisor, IBM
/·h.
·
b
t.
-
F. R.
3
I
Automatic Profiler-Driven Probabilistic Compiler Optimization
by
Daniel Coore
Submitted to the Department of Electrical Engineering and Computer Science
on May 15, 1994 in partial fulfillment of the
requirements for the Degrees of Bachelor of Science and Master of Science in
Computer Science
Abstract
We document the design and implementation of an extension to a compiler in the IBM XL family, to
automatically use profiling information to enhance its optimizations. The novelty of this implemen-
tation is that the compiler and profiler are treated as two separate programs which communicate
through a programmable interface rather than as one integrated system. We also present a new
technique for improving the time to profile applications which resulted from the separate treatment
of the compiler and profiler. We discuss the design issues involved in building the interface, and the
advantages and disadvantages of this approach to building the feedback system.
As a demonstration of the viability of this implementation, branch prediction was implemented as
one of the optimizations using the feedback mechanism. The benefits of the implementation are
evident in the ease with which we are able to modify the criterion that the compiler uses to decide
which type of branch to penalize.
We also address some of the theoretical issues inherent in the process of feedback-based optimization
by presenting a framework for reasoning about the effectiveness of these transformations.
VI-A Company Supervisor: Dr. Peter Oden
Title: Manager, Compiler Optimization Group, Dept 552C
Thesis Supervisor: Thomas F. Knight Jr.
Title: Principal Research Scientist
3
Acknowledgements
Peter Oden, my supervisor, and Daniel Prener, a second supervisor of sorts, spent much
time helping me with many of the technical problems I encountered in the implementation;
without their patience and willingness to help, I would never have completed it.
Ravi Nair wrote the profiler and provided all the technical support for it that I needed
for this thesis. He also contributed a major part to the efficient node profiling algorithm
developed. Jonathan Brezin, as well as the rest of the RS6000 group, provided much support
during my time at IBM.
Equally important was the support I received, while back at MIT, from the AI Lab Switzerland group. They readily accepted me as one of the group, and helped me prepare my oral
presentation. Michael "Ziggy" Blair provided valuable insight into aspects of the subject
matter that I had not considered before. Tom Knight provided useful guidance in the
presentation of this thesis.
Throughout the development of this thesis, I have relied on many friends for support and
guidance. Radhika Nagpal helped me define the thesis when all my ideas were still vying for
my attention in the early stages. Bobby Desai provided precious time listening to my raw
ideas, and helping me refine them-he perhaps, knew more about my thesis than anyone
else. Sumeet Sandhu helped me endure through the long arduous hours and helped me keep
my wits about me.
All my friends from home who were always interested to know the progress of my thesis
helped me stay on track, Tanya Higgins in particular has my wholehearted appreciation for
that. Last, but by no means least, I owe many thanks to my family; to whom I could turn
when all else failed, and who have been very supportive throughout the entire endeavour.
Finally, I offer a prayer of thanks to my God, the source of my inspiration.
5
Contents
1 Introduction And Overview
1.1
1.2
1.3
1.4
Introduction
.
12
. . . . . . . . . . . .
. . . . . . . . .....
1.1.1
Background ................................
1.1.2
Implementations of Feedback Based Compilation
1.1.3
Motivations
1.1.4
Objectives.
1.1.5
Features
.
12
. . . . . . . . . .
. .
. . . . . . . . . . . .
of This Implementation
Overview of The Implementation
.
..........
.
13
. . . . . . . . . . . . . . . .
13
. . . . . . .
14
.....
. . . . . . . .
.
14
........................
15
1.2.1
Compiling With Feedback ........................
16
1.2.2
Unoptimized Code generation ......................
16
1.2.3
Flowgraph Construction
17
1.2.4
Profiling The Unoptimized Code ....................
.........................
.
18
1.2.5 Annotating The Flowgraph.
......................
19
1.2.6
......................
19
Optimized Code Generation.
Summary and Future Work ...........................
19
1.3.1
20
Summary
...........
......................
Bibliographic Notes ................................
20
2 Design and Implementation
2.1
12
22
Designing The Feedback Mechanism
.
.....................
2.1.1
Necessary Properties for Feedback.
2.1.2
Types of Data to Collect.
2.1.3
When to Collect the Data ........................
..................
. . . . . . . . . .
7
. .
22
23
. ......
23
24
2.2 The Implementation ...........
2.3
2.2.1
The Compiler ...........
2.2.2
Producing Code for the Profiler .
2.2.3
Reading Profiler Output .....
2.2.4
Identifying Profiler References
2.2.5
Interfacing with the Compiler . .
The Profiler ................
2.3.1
2.4
The xprof Profiler.
Bibliographic Notes.
...........
....................
....................
....................
....................
....................
....................
....................
....................
....................
3 Efficient Profiling
25
26
26
27
27
28
29
30
30
31
3.1 Introduction .
.
3.1.1
Description of the Problem
3.1.2
Efficient Instrumentation
..................
.
.
....................
3.2
Background To The Algorithm
3.3
Description of The Algorithm .....................
.
...........
Choosing the Edges to be Instrumented
3.3.2
Instrumenting the Chosen Edges ...............
3.3.3
Computing the Results
Correctness of the Algorithm
.....................
3.5
Optimality of The Algorithm
.....................
3.5.1 An Alternate Matrix Representation .............
.
31
.
.
32
.
.
.
33
.
.
·
....................
3.4
.
.
.
3.3.1
.
.
...................
31
.
34
.
.
.
.
·.
.
. .
.
.
.
.
36
.
36
37
.
.
.
38
. . .
.
.
.
.
.
39
.
.
.
40
3.5.2
Bounds for the Number of Instrumented Edges .......
3.5.3
Precisely Expressing the Required Number of Instrumented Edges .
3.5.4 Proof of Optimality ......................
3.6
Running Time of the algorithm ....................
3.7
Matroids.
.
·
.
37
.
.
.
. . . .
42
44
46
47
................................
. . . .
47
3.8
Summary of Results.
3.9
Discussion ................................
. . . .
48
. . . .
3.10 Bibliographic Notes.
..........................
8
49
4 Optimizing with Feedback
50
4.1 Introduction .
4.2
Trace Based Optimizations
4.3
Non-Trace-Based Optimizations ...........
.............
4.3.1
Using Execution Frequencies
4.3.2
Using Other Types of Feedback Information
........
4.4 Branch Prediction ..................
4.4.1
4.5
Implementing the Transformation .....
Bibliographic Notes.
.................
..............
..............
..............
..............
..............
..............
..............
..............
5 A Theoretical Framework for Evaluating Feedback-Driven
5.1 Introduction .
5.2
51
52
52
53
53
54
55
pt imizat ions 57
[O
.oo
..
..
57
..
The Generalized Framework ..................
5.2.1
An Input-Independent
58
Assessment Function.
59
.
5.2.2
Quantitative Comparison of Transformations
..
.
.
..
.
....
60
.o
5.3
50
.
..
...
.
Cycle-Latency Based Assessment ...............
5.3.1
61
Parameter Specification.
61
.
5.3.2
Handling Infinite Loops ................
5.3.3
Computing the Average Cost .............
.
..
.,.,.
62
64
·.
**
.
..
...
.*
.,
.
,
,.
o.
o
.,
.,
.
,
,.
,.
o
5.4 Graph Annotation .......................
5.5
5.6
6
64
Analysis of Branch Prediction .................
Bibliographic Notes.
65
......................
66
Conclusions
6.1
6.2
67
Evaluation of Goals ............
6.1.1
The Design and Implementation
6.1.2
The Optimizations
6.1.3
Theoretical Analysis .......
........
Future Work.
6.2.1
The Node Profiling Algorithm ..
9
....................
....................
....................
....................
....................
....................
67
67
68
68
69
70
6.2.2
Optimizations .............
6.2.3
Theoretical Issues
............
70
.......................
71
A The Specification File
72
A.1 Profiler-Related Commands ..........................
72
A.2 Parsing The Profiler Output ...........................
72
A.3 Accumulation Switches ............................
73
A.4 Computing Annotation Values
........................
.
74
B Invoking The Compiler
75
C Sample Run
77
10
List of Figures
3.1
Replacing an edge with an instrumenting node
3.2
An example of a control flowgraph.
3.3
Redundant Node Information
4.1
Reordering Basic Blocks for Branch Prediction
4.2
Inserting Unconditional Branches to Reorder Instructions
4.3
Filling the Pipeline between Branch Instructions
5.1
The Branch Prediction Transformation ...................
................
34
......................
36
..........................
11
48
................
54
..........
55
...............
56
..
65
Chapter
1
Introduction And Overview
1.1
Introduction
Traditionally, the transformations performed by optimizing compilers have been guided
solely by heuristics. While this approach has had partial success in improving the running
times of programs, it is clear that heuristics alone are not sufficient to allow us to get as
close to optimal performance as we would like. What is needed is profiling to detect the hot
spots of the program so that transformations can focus on these areas [1].
1.1.1
Background
In the majority of compiler implementations, if any profiling was used, it was done manually,
as a separate process from compilation. The data from the profiling would then be used
(manually) either to modify the source code, or to provide hints to the compiler so that it
would do a better job. Ideally, we would like that feedback be provided automatically to
the compiler and for it to be fully dynamic. The compiler should be able to compile code,
profile the execution of the output, and then attempt to recompile based on the additional
information from profiling. The whole process would probably have to be iterated over
several times until a satisfactory proximity to optimality was achieved - even this state
would not be a final one, since the application would be recompiled as its usage pattern
changed.
The idea of automatically providing feedback to the compiler to improve the effects of its
optimization phase has been attempted, but research in the area has been scant. Most
of the past work has emerged in connection with research on VLIW machines, however
with the decline in popularity of that architecture - particularly in industry - research in
feedback based compiler optimizations has flagged commensurately.
Probably the earliest works involving feedback based compilation were done by Fisher [11].
He described Trace Scheduling as a technique, relying on information from a profiler, to
determine paths through a program which were most likely to be executed sequentially.
Ellis describes in detail [9], the implementation and application of Trace Scheduling to
compilation techniques. Both Ellis' and Fisher's works were produced through investigative
12
efforts in VLIW architecture.
More recently, there have been other efforts at using feedback from profilers to direct compiler optimizations. The bulk of this effort has come from the University of Illinois where
an optimizing compiler using profiler information has already been implemented and is currently being maintained [4]. Most of the background research for this thesis was centered
around the research results of that compiler.
1.1.2
Implementations of Feedback Based Compilation
One of the important problems with feedback based optimization is that it is inherently expensive both in time and space. First, the need to profile when tuning is desired means that
compilation time will now be dependent upon the running time of the programs compiled'.
The space problems arise because of the need to constantly recompile applications. If a
fully dynamic system is to be realized, the compiler must be readily accessible-probably
lying dormant within the runtime system, but ready to be invoked whenever it is deemed
necessary to recompile an application. In a sense, the compilation process is distributed
over much longer periods which means that data structures have to be maintained between
optimization attempts, thereby increasing the space required for the runtime system and
the whole compilation process in general. Under this ideal view, compilation is an ongoing process and continues for as long as the application is being used. While there are no
implementations of such a system, there are many of the opinion that it is an attainable
ideal [8], and in fact, one that represents the next generation of compiler and runtime-system
interaction
[3].
In all the implementations documented, the system implemented was a compromise between
the way compilation is currently done and the ideal described above. Typically, there
was a database per application which was updated everytime the application was profiled.
The compiler was called manually whenever there was a chance for improvement in the
application's performance. The profiler was called automatically from the compiler, and
the recompilation using the additional profiled data was also done automatically.
Given the computing power available today, this implementation is a reasonable compromise
since our present technology lacks the ability of self-evaluation that the compiler would need
to be able to automatically determine when recompilation was advisable. By automating
the compiler-profiler interaction, much of the tedium is expedited automatically. At the
same time, maintaining manual control over the frequency of compilation ensures that the
runtime system is not spending too much time trying to make programs better without
actually using them.
1.1.3
Motivations
The idea of using feedback to a compiler to improve its optimization strategy has never been
very popular in industrial compilers and has remained mostly within the academic research
domain. The IBM XL family of compilers is not an exception, and we would like to explore
1Profilers usually limit the amount of data collected. So there will be an upper bound on the running
time, though it may be quite high.
13
the costs and benefits of implementing such a feedback system within the framework already
present for these compilers.
In light of rapidly increasing size and scope of application programs and the constant demand for efficiency, coupled with ever increasing processor speeds mollifying concerns about
the time investment required, we feel that it is worthwhile researching this approach to
compiler optimization. Also, from an academic standpoint, the large space of unexplored
problems in this area promise to provide exciting research with highly applicable results.
1.1.4
Objectives
This thesis will describe the design and implementation of such a feedback based compiler
system with a specific focus on the RS6000 super scalar architecture. We hope also to
address some theoretical issues inherent in the process which are still largely unexplored.
The primary focus will be on the design issues involved in the implementation of the feedback mechanism within the XL family of compilers. The secondary goal is a two-pronged
endeavour: a theoretical approach to reasoning about the effectiveness of feedback-based
optimizations, and an exploration of the ways in which feedback information can be applied.
1.1.5
Features of This Implementation
One important difference between the goals of this thesis and the implementations aforementioned is that the compiler and the profiler are considered as separate tools, which can
be enhanced separately as the need arises. In the past implementations, the profiler was
incorporated into the compiler and the two functioned as a single program. While there are
certain advantages to this type of implementation such as greater efficiency, the compiler
the embedded
is restricted to receiving only one type of feedback information -whatever
profiler happens to return. Having the option to easily change the profiling tool allows
greater flexibility in the type of information fed back, and can for example, be invaluable
in investigating how changes in program structure affect various run-time observables.
Another difference in this implementation is that this work was an extension of an existing
compiler to allow it to take advantage of the additional information (from profiling). This
fact meant that there were several imposed constraints that restricted the solutions to some
problems that would otherwise have been solved by the past implementations. For example,
the order in which optimizations were done was critical in the compiler, and this restricted
the points in the optimization phase at which the feedback data could be used.
Finally, we would like to point out that this implementation does not attempt to achieve
the ideal setting of a dynamic compiler constantly recompiling applications as the need
arises. It is still invoked manually, as the user sees the need for improvement in the code,
but it does perform all the profiler interaction automatically. While this approach is not
ideal, we feel that the results that this research yields will provide important groundwork
for implementing the ideal environment described earlier.
14
Theoretical Work
Although much work has been done to show the experimental success of feedback based
optimizations, those experiments were performed under very controlled environments. Typically the results of those experiments, and their ramifications, were specific to the particular
programs compiled. These results were often used to cite the benefits of the particular optimization under investigation, but we would like to be able to show with more confidence
that the optimization is valuable in general (i.e. "on average") rather than merely for the
programs that happened to be in the test suite. Although attempts have been made to
theoretically model the behaviour of feedback-based optimizations, no models have been
able to predict their stability nor even evaluate the suitability of an input as a predicting
input2 to the application.
While the design and analysis of such a model in its full specification is an ambitious task,
one of the goals of this thesis was to address the issues involved with that task. Although the
model developed is incomplete in that its accuracy has not been measured, we present some
of the necessary features of any such model and put forward arguments for why the model is
reasonable. We have presented the model, despite its incompleteness, in the hope that it will
provide essential groundwork for furthering our understanding of the subject matter. Any
progress in this direction will help to solidify cost-benefit analyses and will certainly further
our understanding of program-input interaction and the relationships between program
structure and program performance.
1.2
Overview of The Implementation
The feedback system was designed to allow the most naive of users to be able to use it
without having to worry about issues that were not present before the feedback mechanism
was introduced. At the same time we wanted to make it powerful enough to allow much
flexibility in the manipulation of the statistical data without exposing more details of the
compiler than were necessary. The few aspects of the feedback mechanism which require
user interaction are hidden from the user who is not interested in them, through the use of
suitable defaults.
In this manner, the user who simply wants the benefit of feedback in optimization need not
worry about the type of profiler being used or any of the other issues that are not directly
related to the application itself. At the same time, another user who wanted to see the
effects on a particular optimization when decisions are based on different types of statistics
(which may need different profilers to provide them) could specify this in a relatively painless
manner without having to knowanything about the internals of the compiler. Note however,
that the second user would have to know more about the system than the first, in order to
be able to change the defaults as appropriate. The general idea is to provide suitable layers
of abstraction in the system that expose only as much detail as is desired by users without
compromising the benefits of feedback based compilation.
2
A predicting input is one which is used with the application at profile time to provide statistics that
may guide the optimizations later.
15
1.2.1
Compiling With Feedback
There are two phases to compilation. The high-level decomposition of the whole process is
as follows:
1. Phase
1
(a) Generate unoptimized executable code from source
(b) Construct the corresponding control flowgraph
2. Phase 2
(a) Profile the unoptimized code from phase 1
(b) Annotate the control flowgraph with the statistics
(c) Generate optimized code based on both feedback information and on traditional
heuristics
All this is abstracted away from the user through a high level program which manages the
processing of the two phases without input from the user. This high level program is the
vehicle through which the user interface is presented since this is the only program that the
user needs to be aware of to use the feedback mechanism.
1.2.2
Unoptimized Code generation
The first phase is mainly concerned with generating executable code in a manner that
preserves some kind of context about the instructions so that when the profiled data is
returned, the compilerknows to which instructions the statistics pertain. Sincethe compiler
is called twice, it must be able, in the second phase, to reproduce the flowgraph generated
in the first phase, if it is to perform different operations on the output code in both phases
with any level of consistency between the phases.
To achieve that consistency, the code is largely unoptimized so that the second phase will
generate the same flowgraph as the first phase does. Were the code optimized within the
first phase, then the profiler, which looks at the object code produced, would provide data
on instructions that the compiler would have difficulty tracking the source of. This difficulty
arises because of the extensive transformations that occur when optimization is turned on
within the compiler. Therefore, when the profiler returned a frequency count for a particular
basic block, the compiler would not necessarily be able to figure out the connections between
the flowgraph at the early stages of compilation and that basic block. The alternative would
have been to have the compiler fully optimize in both passes, so that the flowgraphs would
be consistent between phases, but unfortunately the implementation of the compiler does
not allow certain optimizations to be repeated. Therefore, if the code were fully optimized
before the statistics from profiling could be obtained, the code would not be amenable to
several of the possible transformations that take advantage of the feedback data.
Actually, the requirement for producing a consistent mapping from profiled data to data
structure is less stringent than pointed out above. We really only need to capture the feedback data at some point in the optimization phase, and then introduce that data at the same
16
point in the second phase (see Chapter 2). The reason for this particular implementation
strategy is that the compiler assumed would attempt to do all the optimizations it knew
about if optimizations were turned on. Only if they had been disabled from the command
line, would they be ignored. This user directed choice of optimization, while a good idea
in general, caused a problem in identifying a point in the optimization process that could
be repeatedly reached with the same state on every pass. The problem is that once a point
in the process has been chosen, we want to turn off all the optimizations that come after
that point, and go straight to the next phase of compilation (register allocation in this
iplementation). However,a simplejump to the next stage is not good enough because there
were some processes that had to always be done, but which followed optimizations that
were conditionally done. Therefore, the solution would have been to find all the essential
subroutines that had to be executed and then ensure that no matter which point was chosen to capture the fedbakc data, all those subroutines would get executed. It was difficult,
though not impossible, to separate these subroutines from the conditional optimizations,
and so we opted for the extreme choices of all or no optimizations.
It turned out that there were a few optimizations that could be performed early and were
repeatable and so could be done in the first phase. Ideally we would have liked to optimize
the code as much as possible, while observing all the constraints imposed. The executable
generated from the first phase has to be profiled on several inputs, which means that it
must be executed for each of those inputs. The profiling process is the bottleneck in the
whole compiilation process and so it is highly advantageous to have this code run as fast as
possible.
Perhaps the biggest drawback to having the profiler and compiler separated is that there is
no a priori means of communication between the two tools. Specifically, there is no way for
the compiler to associate the frequency data returned by the profiler,with the instructions
that it generates. Typically the profiler returns the addresses of the instructions along with
the frequencies of execution, but this is useless to the compiler since the addresses were
assigned at bind (link) time which happens after compilation is done. The solution used
by this implementation was to use the space in the object format for line numbers to store
unique tags per basic block. The profiler could return the line number (actually what it
thought was the line number) with the instruction frequencies and the compiler would use
that "line number" to decipher which instruction was being referenced and hence figure out
its execution frequency. This solution has the advantage that many profilers will already
have the ability to return line numbers along with instructions so that the compiler is still
easy to be compatible with. The disadvantage is that the association set up with this
scheme is rather fragile, and so does not make keeping databases of profiled runs feasible
since minor changes in the source will cause a complete breakdown of the tag to basic block
associations.
1.2.3
Flowgraph Construction
Compilation is done on a per file basis which is further broken down on a per procedure
basis. A flowgraph is built for each procedure in the file. Originally, only one flowgraph was
constructed at a time, and was processed all the way to code generation before processing
the next procedure. In the modified implementation, we constructed all the flowgraphs for
17
the procedures at once, so that when the statistical data was read in the second phase,
only one pass of the data file would be necessary. This approach is more efficient in terms
of time since the interaction with the file system is reduced to a minimum, however there
is a corresponding space tradeoff since all the flowgraphs have to be contained in memory
at once. In practice, this memory constraint does not prove to be a problem because files
are typically written to contain only one major procedure. It is extremely rare that files
contain several large procedures.
One of the types of information that is used by most transformations is the frequencies of
flow control path traversals (edge frequencies). However, this information can be expensive
to obtain if the profiler does not have any more information than the program and inputs
it is given to profile. In previous implementations where the compiler and profiler are
integrated, the profiler can analyze the flowgraph and then decide how to optimally profile
it to obtain all the desired information. This feature is lost in this implementation since
the profiler and compiler are separated, however the edge frequency information is still
as important. The solution was to require that the profiler be able to produce frequency
counts for basic blocks - the nodes of the control flowgraph - and place no constraints
on its ability to produce edge information. This requirement also had the side effect of
making compatibility easier. In order to obtain the edge frequencies solely from the node
frequencies, the compiler had to modify the flowgraphs so that there would be enough node
information returned by the profiler to solve for the edge frequencies. The modification was
done according to a newly developed algorithm which is based on the fact that control flow
obeys Kirchoff's current law at the nodes of the flowgraph. The algorithm modifies the
flowgraph optimally in the sense that the minimal number of edges are instrumented, and
it works on arbitrary topologies of flowgraphs.
Since in this stage, no data is available to actually solve for the edges, each edge is annotated
with a symbolic expression which represents a solution in terms of the node frequencies and
possibly other edge frequencies, which are all determined immediately after profiling. After
the profiling is done, the actual numbers are substituted into the expressions which are
then evaluated to yield the frequencies for the edges. The details of the implementation
and the algorithm including arguments for its correctness and optimality are given in Chapter 3, however the details therein are not necessary for appreciating the practicality of the
algorithm.
1.2.4
Profiling The Unoptimized Code
The interaction between the compiler and profiler is summarized in a specification file.
This file contains all the templates for reading the output of the profiler, and also all the
instructions for invoking the profiler. This file is actually provided by the user, but by
producing default ones for popular profilers, casual users need not have to worry about its
details.
The other requirement for profiling the code is the data set on which the code should be
executed while it is being profiled. This is provided through yet another file which contains
all the inputs to the application that the user would like the code to be profiled with. Each
line of this input file contains exactly the sequence of characters that would come after the
invocation of the program under manual control. Although all users must specify this file
18
and create it themselves, it is a relatively painless process, since the data that is put into
this file is no more than the user would have had to know about in any event. The data
presented within this file is called the predicting input set since these are the inputs which
ultimately get used as a basis for assigning probabilities to each branch of the flowgraph.
Thus they "predict" which paths through the flowgraph are likely.
1.2.5
Annotating The Flowgraph
The profiler produces one output file per profiled run which is done once per input in the
input file. Therefore, at the end of the profiling stage, there are as many profiler output
files as there are inputs in the input set. The frequencies for the basic blocks, produced
in each file, are accumulated and then associated with the appropriate tag for the basic
block (the tag was produced within the file along with the basic block it was associated
with). These accumulated values are then used to annotate the appropriate node of the
appropriate flowgraph - as dictated by the tag. The final step in the annotation process
is to substitute the actual accumulated values into the expressions computed for the edge
frequencies in the flowgraph construction stage to yield the values for the accumulated edge
frequencies of the flowgraph.
1.2.6
Optimized Code Generation
Although there are several different optimizations that are possible with feedback information available (see Chapter 4), we had enough time to implement only one of them - namely
branch prediction. Even so, it was only implemented for handling the limited case of an
occurrence of an if-then-else construct. This optimization was heavily machine dependent
and will only be beneficial to machines with uneven penalties for conditional branch paths.
This optimization was apt for the RS6000 which has a branch penalty of up to three cycles
for conditional branches taken and zero cycles for conditional branches not taken; uncondi-
tional branches have zero penalty provided there are at least three instructions before the
next branch at the unconditional branch target address.
The compiler assigned probabilities to each edge based on the frequencies measured from
the flowgraph annotation stage. Each edge was also assigned a cost dependent upon the
inherent cost of traversing that edge and upon the number of instructions within the basic
block to which the edge led. From these two quantitites, a weighted cost was computed and
was compared with that of the other edge leaving the same node. If the edge with the higher
weighted cost was sufficiently high (that is, above the other by some fixed threshold) then
the compilerensured that that branch corresponded to the false condition of the predicate of
the conditional branch instruction. In this manner, edges with larger latencies and smaller
probabilities are penalized more than those with smaller latencies and larger probabilities.
1.3
Summary and Future Work
In our paradigm for the user interface, we wanted to go even further and provide a means
of abstractly specifying optimizations in terms of the transformations they would represent
19
on the control flowgraph.
This would have been achieved by developing a language for
manipulating the internal data structures of the compilerthat could simultaneouslyrefer to
data being produced by the profiler. This would have allowed yet another level of user namely the compiler researcher - a convenient abstraction to the compiler. Already there
are two extremes of users who are accommodated: the naive user who is not concerned with
the details of the profiler being used, and the performance-driven user who is constantly
attempting to find the most suitable type of statistics for yielding the best improvement
in code performance. The third type of user would have been the user who not only was
interested in the type of statistics being used, but also with what was being done with the
statistics.
This extension to the compiler interface is a major undertaking however, and could not be
accomplished within the time frame set out for this thesis. An implementation of such an
interface would prove to be extremely useful since it would not be necessary for someone to
understand the internals of the compiler before being able to specify an optimization.
Numerous other optimizations could have been implemented as they have been in the other
implementations of feedback-based systems. For example, the entire family of trace based
optimizations would make excellent use of the feedback information and would have probably yielded favourable performance enhancements. These optimizations, although highly
applicable were not focussed upon since their effects have been widely investigated with
positive results. Other optimizations which could not be completed in time, but which
would be interesting to investigate include applications of feedback to code specialization
and determining dynamic leaf procedures 3
1.3.1
Summary
Although rudimentary in its implementation, and lacking many of the interesting uses of
feedback data, the feedback based compiler has provided useful information about dealing
with feedback systems and has raised many issues to be further researched. This thesis
established a modular implementation of a compiler with a feedback mechanism to guide
its optimizations. The knowledge obtained from tackling the design problems that arose
in the implementation will provide useful guidelines to future implementations. One of the
useful results is that the expense of profiling arbitrary programs can be reduced by reducing
the amount of instrumentation that would normally be done. This reduced cost of profiling
enhanced the viability of a modular implementation since under normal circumstances the
profiler would not have information available in integrated implementations and would
therefore be expensive to run in the general case. We have also developed a framework for
assessing the effect of optimizations on programs using feedback information.
1.4
Bibliographic Notes
Aho, Sethi and Ullman state that we can get the most payoff for the least effort by using
profiling information to identify the most frequently executed parts ofa program [1]. They
3
A leaf procedure is one that does not call another procedure, it is a leaf in a call graph for the program.
20
advocate using a profiler to guide optimizations, although they do not make any mention
of automating the process.
At MIT, members of the Transit Project group and Blair [8, 3] independently advocate the
adaptive dynamic feedback-based compiler as the archetype compiler of the future.
In the Coordinated Science Laboratory at University of Illinois researchers have implemented IMPACT [4] a feedback-based optimizing compiler that uses the feedback to implement several different optimizations ranging from trace scheduling with classic optimizations
to function inlining.
Fisher first came up with trace scheduling in 1981 [10] as a means of reducing the length
of instructions given to a VLIW machine. Ellis later describes in detail how this is done in
his PhD thesis under Fisher [9].
21
Chapter 2
Design and Implementation
This chapter describes the details of the implementation and discusses the design issues
that were involved. The first section describes in detail the general requirements of the
compiler, profiler and the interface for them to cooperate in a mechanism for providing
feedback to the compiler. The second section describes the possible implementations for
the general requirements outlined in the first section, the actual implementation that was
used and the reasons for choosing it. In many cases, a problem was solved in a particular
way because it was the best solution in terms of accommodating the constraints imposed by
the conventions of the existing tools and environment. In those cases, we try to point out
what might have been ideal and any major disadvantages of the chosen implementation.
As mentioned before, we chose to implement the profiler and the compiler as two separate
tools for two main reasons. The first is that the system is more modular in this way, because
now the compiler and the profiler can be upgraded individually so long as the protocol with
their interface is not violated; therefore, maintenance is simplified. The second reason is that
the compiler can be easily configured to optimize according to different types of statistics.
For example, cache miss data per basic block or per instruction would provide guidance
to the compiler that may help in laying out the code to improve cache behaviour. The
flexibility offered by the modularity provides a powerful research tool which would be much
more difficult to implement and manipulate with an integrated compiler-profiler system.
2.1
Designing The Feedback Mechanism
The first step in the design of the interface was to identify the necessary properties of the
compiler and the profiler for them to be able to interact in a meaningful manner. The
next step was to decide how the dynamic data (from the profiler) should be captured, what
should be collected and how it should be stored. In other words, the first consideration was
a question of how the data should be obtained and the second was a matter of what the
collected data could be used for.
22
2.1.1 Necessary Properties for Feedback
The compiler produces executable code as input to the profiler, and it also reads in the
output of the profiler as part of its own input. Therefore, the compiler must be able to
produce an executable suitable for the profiler and also be able to parse the output of the
profiler. Suiting the needs of the profiler may be done in several ways according to the
profiler being used. Parsing the output of the profiler is again highly dependent on the
profiler being used. Even beyond these problems of protocol specification, the compiler
needs to be able to identify the instructions being referred to by the profiler when it returns
its frequency counts. Summarizing, there are three important properties that the feedbackbased compiler must have:
1. It must be able to produce code suitable for the profiler.
2. It must understand the format of the profiler output.
3. The compiler must understand how the profiler refers to the instructions that are
presented to it for profiling.
To elaborate the point, since the compiler and the profiler are two separate programs,
they do not share any internal data structures. Therefore, the compiler's view of the code
generated is very different from the profiler's view of the code executed. In particular, the
compiler never sees the absolute addresses assigned to the instructions, whereas that is all
the profiler has to refer to the instructions by. Furthermore, the data actually desired is
basic block counts, so the compiler has the additional problem of identifying the basic block
to which a particular instruction in the final output belonged.
We mention that we desire basic block counts although profilers can usually return frequency
data on many granularities of code ranging from machine operations to basic blocks. Basic
block information is preferred because that granularity is coarse enough to provide a good
summary of control flow from the data, and fine enough to allow information on finer
granularities to be obtained. For example, knowing the frequency of execution of all the
basic blocks, along with the instructions that are in the blocks, is sufficient to deduce the
instruction distribution. Each instruction within a basic block is executed as many times
as the basic block itself is, so by summing the frequencies for each basic block in which
a particular type of instruction appears, we can obtain the total number of times that
instruction type was executed.
2.1.2
Types of Data to Collect
Deciding what kind of data to store is largely determined by the features of the profiler.
In most cases, the profiler will give only various types of frequency summaries for some
granularity of code - basic block, instruction, or operation for example. For these, deciding
what to store is not much of a problem since we can typically store all the frequencies.
However, if the profiler has the ability to return many possible types of summaries it may
be too expensive to store all the possibilities. For example, one could imagine a profiler that
had the ability to return information about the values of variables within the basic block,
23
so that at the end of the profiled run, it would also return a distribution of the range of
values assumed by the variables within
voluminous to store especially if it were
and for an especially long profiled run.
the designer) about what kinds of data
a block. This type of information would be rather
done for several variables across several basic blocks
In cases like these, decisions have to be made (by
are relevant for the uses he has in mind.
Therefore, deciding what the feedback data will be used for is highly dependent on the kind
of data that the profiler can return, and conversely,deciding what kinds of data to store in the cases where we have a choice - is largely dependent on the proposed uses of the data.
This interdependence seems to suggest that the data stored should, in some way, be linked
to a semantic model of the program, since the semantics will be the indicators of what kinds
of optimizations can be done in the later phases of compilation. Also, the semantic model
used should be one that the compiler is already familiar with (or will be eventually) since it
represents what the compiler "knows" about the program. By annotating this model with
the feedback data, the compiler will - in some sense - be able to ascribe meaning to the
data.
Having data that does not directly relate to any of the compiler's semantic models will
probably be difficult to use in the optimization phase, since those statistics will be hard to
describe in terms of the model that the optimization is implemented in. For example, if the
semantic model for a particular optimization was the control flowgraph, and the data that
the profiler returned included the frequencies of procedures called, then that data would
not be as effective as if the semantic representation had been a call graph instead. In that
example, if there was no consideration ever given to the call graph in the optimization
phase of compilation, then it was probably not worth the space or the time to collect that
particular type of data.
We see that there is an interdependence between the data that is available in the first place,
the data that is stored and the possible uses of it. In some sense, the semantic models that
are available determine what kinds of data it makes sense to store, and also what might
be done with the data by the optimizer. Of course, the programmer may decide that he
wants to use some particular type of data for a particular optimization, but if there is no
semantic model to support the data directly, then he must implement a model which can
take advantage of the data he is interested in using. Not surprisingly, both the kind of data
being returned and the intended uses for the data will affect when and how, during the
compilation process, the data is captured.
2.1.3 When to Collect the Data
When the data ought to be captured is determined by properties of the compiler and by
the intended uses for that data. How the feedback data is captured also depends on the
compiler. In general, code generation has to occur at least twice. Once to give the profiler
code to profile, and the second time to use the feedback data with the optimizer. How these
two passes of the code generator are done is subject to two general constraints:
1. The code generated on the first pass must be done using only the state of the semantic
models at the time of data capture.
24
2. After the point of data capture, all of the semantic models for which data will be
collected must either be exactly reproducible if necessary or remain intact after the
first pass of code was generated.
The first constraint guarantees that the code to be optimized using feedback is the same
as what was profiled. The second ensures that the data retrieved is associated with the
semantic models in the appropriate way since any assumptions about the code in the first
pass of code generation will not be violated afterwards. When the profiler results are
returned, they ought to be used to annotate the semantic models present after that data
has been retrieved. In the case when this pass is a completely different run from the first
pass of the code generator, the second constraint ensures that the semantic model that
is annotated with the data will be generated to have the same structure as the semantic
data structure that produced the code that was profiled. The first condition guarantees us
that that data structure is in fact the one that corresponds to the code that was produced
in the first pass and profiled. In the case when the semantic models are preserved across
the first pass of code generation, the first constraint again ensures that the data returned
corresponds with the models.
Since the code produced will be profiled for each of the inputs given in the input set, it
should be optimized as much as possible. In this way, the overall profiling time will decrease.
There is a tradeoff involved, though that should be noted. There are many optimizations
that we would like to use in conjunction with feedback information such as loop unrolling,
and common subexpression elimination [1]with large traces obtained from application of the
feedback data. Therefore, it would seem reasonable to collect the feedback data before these
optimizations took place. However, we cannot use them in the code that will be profiled
unless we are willing to repeat them. While repeating those optimizations might not be a
bad idea in general, there is the question of whether the optimizations will adversely affect
each other when they are used in the second pass of code generation.
The tradeoff then, is whether to try to produce code that can be profiled quickly, or to
sacrifice time in the profiling stage of the compilation process, in the hope that the resulting
optimized code will be much faster than otherwise. The two options indicate collecting the
feedback data at two different possible extremes: either very early when there is no first
pass optimization and therefore greater opportunity for application of feedback data or very
late, when the output code will be very fast, but the feedback data will not be used very
much to improve the code generated. It seems that duplicating the optimizations so that
we get both benefits is the most favourable option. Failing that however, choosing some
suitable compromise in optimization of the profiled code seems to be the correct action to
take. A judicious choice can be made by using some of the optimizations that do not stand
to gain much from feedback information, such as constant folding and propagation on the
code in the first pass of code generation.
2.2 The Implementation
This section discusses the possible implementations given the design requirements discussed
in the first section, and gives justification for the implementation chosen. In many instances,
the chosen implementation was dictated by the constraints of the already existing system
25
rather than any deliberate decision. For these, we discuss how the implementation could
possibly be improved if these constraints were relaxed.
2.2.1
The Compiler
The compiler used was IBM's production compiler called TOBEY 1 . It is written in PL.8, a
subset of PL/1, which was developed several years ago explicitly for implementing an optimizing compiler for the then new RISC architecture. The TOBEY compiler has had several
optimizations implemented in it throughout its lifetime, resulting in a highly optimizing
compiler; however, it has also grown to immense proportions, with many esoteric intricacies
that often proved to be obstacles to successful code manipulation.
The compiler's operations are broken down into five stages. The first stage includes all the
lexing and parsing necessary to build the data structures that are used in the subsequent
stages. The second stage generates code in an intermediate language called XIL, a register
transfer language that has been highly attuned to the RS6000 architecture. The third stage
performs optimizations on the XIL produced from the second stage and also produces XIL.
Register allocation is done in the fourth stage and further optimizations are then performed
on the resulting code. The fifth and final stage generates assembly code from the XIL
produced in the fourth stage.
There are essentially two levels of optimizations possible within the compiler, the actual
level to be used for a particular compilation is given as a compile-time parameter passed
with the call to the compiler. At the lower level of optimization, only commoning is done
and only within basic blocks. The higher level of optimization implements all the classic
code optimizations such as common subexpression elimination and loop unrolling as well
as a few which were developed at IBM such as reassociation [5]. In this implementation,
there is actually a third level, not much higher than the lower one, which performs minor
optimizations within basic blocks such as value numbering and dead code elimination. This
was the level at which code was generated for profiling since all the optimizations at this
level satisfied all the constraints mentioned earlier.
One unfortunate characteristic of the optimizations done within the TOBEY compiler is
that the operation of some optimizations are not independent of others. Therefore, the XIL
produced by some optimizations is not always suitable as input to those same optimizations
and to others. The reason this is unfortunate is that the order in which optimizations are
done is very important and requires detailed knowledge of the interdependencies to be safely
modified. Two main problems arose in the implementation of the feedback mechanism
because of this: modifications to the code to accommodate the data structures for the
mechanism were often difficult or cumbersome, and some optimizations that stood to benefit
from feedback information could not be done on the code that was to be profiled.
2.2.2
Producing Code for the Profiler
In this implementation, the code was generated twice in two separate runs of the whole
compiler. The profiler was given only the code from the first pass, and this code had minimal
1
Toronto Optimizing Back End with Yorktown
26
optimizations performed on it. On the second pass, all the data structures constructed
for code generation in the first pass were duplicated exactly, just as was discussed in the
constraints for data collection and code generation. Since the compiler is deterministic, the
code structure produced is exactly the same as that of the first pass, and hence the same
as that given to the profiler. Therefore, any reference to the data structures for the code
(say a control flow dependency graph) of the first pass is valid and has the same meaning
as the equivalent reference in the second pass.
We could not attempt the preferable option of duplicating the optimizations used between
the two code generation passes because of the incompatibility of some optimizations with
others as mentioned above. Within the TOBEY compiler, there are dependencies between
the optimizations themselves, so that some cannot be performed after others, and some
cannot be repeated on the same XIL code. These constraints prevented the first pass of
code from being optimized too highly.
2.2.3
Reading Profiler Output
As mentioned in the earlier section, the second requirement of the compiler is that it needs
to know the format of the output of the profiler so that it can automatically extract the
appropriate statistics from the profiler's output. This requirement is essentially for automation only. The first requirement would have been necessary even if the data was going
to be fed back to the compiler manually. Automating the feedback process requires that
the compiler be able to identify important portions of the profiler output and ascribe the
appropriate meaning to them. For example, the compiler must know where in the output
to find the frequency count for a particular basic block.
As part of the understanding that the compiler has of the profiler's output, the compiler
must know not only where to find the appropriate data, but also what ought to be done
with that data. For example, it may be that the profiler produces frequency counts as well
as the length of the basic block in terms of instructions. If we care about both items of
information, then the compiler needs to know not only where to find these items in the whole
output, but also how these data should be used in conjunction with its semantic models.
This particular detail posed a small problem in the interface since it was not obvious how
the user could describe what data was supposed to be interesting and how it would be
interesting, without having to know too much of the compiler's innards. The solution was
to settle for the fact that a control flowgraph could safely be assumed to be used by the
compiler. Then any data was considered to be for annotation of that flowgraph. Ultimately,
we would like something more universal than this, but that would require a sophisticated
means of specifying the semantic model to use and further, the way in which the data should
be used with that model.
2.2.4 Identifying Profiler References
The third requirement of the compiler is that there must be some means of identifying the
instructions (and the basic blocks) that the profiler refers to when it returns its statistics.
The solution employed was to generate a tag per basic block which was unique within a file.
This tag was then written out with the rest of the object code into the object files, in such
27
a way that it would be preserved throughout linking and loading and finally dumped into
the executable where it would be visible to the profiler. The profiler would then report the
tag along with the frequency for the instruction associated with the tag.
There were two potential problems with this proposed solution. The first was that while the
format of the object file was rather liberal and would allow such tags within it, there was no
simple way to force the linker and loader to reproduce those tags within the executable since
the format of the executable, XCOFF2 , is quite rigid, and the linker discards unreferenced
items. The second problem was that since this data inserted by the compiler was custom
data within the executable, the average profiler would not have been capable of returning
that data - defeating the central goal of wide compatibility.
The solution to both problems was to use the line number fields within the executable to
contain the tags. Since the XCOFF format already had space to accommodate line numbers,
no additional modification was necessary, either to the XCOFF format or to the linker and
loader, to preserve the tags from the object files. Also, since the line number is a standard
part of the XCOFF format, the chances that the profiler will be able to return that data field
are high. The limitations of this solution are that the tags are restricted to being at most
16 bits wide which turned out to be suboptimal, and that the line number information,
which is very useful while debugging, is unavailable while the tags are being reported. That
second limitation is not a problem in general since on the second pass, we can restore the
actual line numbers to the fields. In the second pass, there is no need to have profiler tags
since the compiler will not be profiling the output of the second phase. Were the compiler to
continually massage the same code over and over by repeatedly profiling and recompiling,
then we would be forced to permanently substitute tags for line numbers, in which case an
alternate solution would probably have to be found.
2.2.5
Interfacing with-the Compiler
Designing the interface between the compiler and the profiler is, perhaps, the most important part of the entire project. The specification of the interface determines the kinds of
profilers that are compatible with the compiler, and what features, if any, of the compiler
will need to be preserved when it is upgraded. A second aspect of the interface was that
between the compiler and the user. This aspect allows the user to be able to specify the
manner in which the feedback data should be processed as well as indicate to the compiler
which parts of the profiler data should be considered important.
An integrated user/profiler interface was developed because of the close interrelation between their functions. It should be noted however, that the user interface aspect was not
stressed. So while it provides the user with means of expressing manipulations on the feedback data, it could only offer modest expressive power, and an admittedly esoteric syntax
for using it.
All the programmable aspects of the interface were specified in one file called the specification
file. This file contained all the directives to the compiler to allow it to communicate with
the profiler according to the requirements outlined above. In particular, the specification
file contained information on how the compiler should call the profiler, parse its output and
2
IBM's extension to the UNIX standard c
FF
format [13]
28
also choose which parts of the profiler output were important.
One disadvantage to incorporating this type of generality in the interface is that there is
a danger of escalating the complexity to such levels that even simple tasks become tedious
or difficult to express. To alleviate this problem, a specification file was designed for the
particular profiler being used. One of the input parameters to the compiler was the name
of the specification file to use in interfacing with the profiler. This particular specification
file was set up as the default, so that a simple call to the compiler with feedback turned
on would use the default profiler, without the user having to worry about it. One could
envision several specification files being implemented, one for each popular profiler available,
so that many different types of feedback might be available without the user having any
detailed knowledge of the system. At the same time, the compiler researcher who may care
to investigate the effects of different types of feedback information available from the same
profiler would want the expressive power offered by the programmable interface to allow
him to pursue his research. We feel that this implementation of the interface provides a
suitable compromise between simplification of use and expressive power available.
2.3 The Profiler
Almost any profiler could be used as part of a feedback mechanism, and so it would not be
very instructive to try to identify essential features of a profiler for this purpose. However,
there are several restrictions on the profiler that can be used with this implementation of
the feedback mechanism, and this we discuss in detail below.
Since a central goal of this implementation was for the system to be able to use the results of many profilers, the requirements on them were kept to a minimum. The essential
requirements of any profiler compatible with the system are three:
1. It must be able to produce frequency counts for basic blocks.
2. It must be able to return the source line number corresponding to an instruction in
the executable.
3. It must return along with that line number, the source file name.
The first requirement is necessary - even if that data will not be used for optimizations because the algorithm that determines edge frequencies depends on these values. Although,
a limitation of sorts, it is not very serious, since in most instances, this piece of information
is exactly what an optimization is looking for. The real limiting factor of this requirement
is that the exact counts are needed as opposed to a program counter sampled profile of
the program. This immediately excludes a large class of profilers from being compatible,
although this limitation is mitigated by the fact that profilers of that class typically work
at a coarser granularity than the basic block. Most of those profilers-prof and gprof are
examples-report
sampled timing information on procedures and so tend to provide useful
information for annotating a call graph, rather than a control flowgraph.
The ideal situation is for a profiler to be able to return both types of frequency counts and
have both the call graph and control flowgraph as views of the program at two different
29
granularities. The call graph which would be annotated by the sampling profiles would help
direct the basic block profiler by indicating the hot procedures. This feature would be extremely important in an adaptive dynamic implementation of feedback based optimization,
but in this static setting, profiling at the coarser level is left up to the user.
2.3.1
The xprof Profiler
The profiler used with this implementation was the xprof profiler, designed and implemented
by Nair [14]. One important feature of this profiler was that it did not require the instrumentation of nodes to be done by the compiler. That is, unlike prof and gprof, there was no
parameter that needed to be passed to the compiler to allow the output to be compatible
with xprof. This profiler could return various types of frequency information ranging from
instruction type summaries to basic block frequencies. The primary mode of usage in the
implementation for this thesis was for it to return basic block frequencies.
Along with the frequency of execution of a basic block, and its tag (line number), the xprof
profiler could also return the start and end addresses of the basic blocks it encountered.
These two values together indicated the size of the basic block in terms of instructions,
which was used to help weight the branch proabilities in the implementation of branch
prediction (see Chapter 4.
2.4
Bibliographic Notes
gprof and prof are profilers that are part of the UNIX standard. They both sample the
machine state periodically and report a summary of the relative frequencies of the procedures that were executed. In contrast, xprof returns actual basic block counts for all the
procedures executed.
Reassociation was developed by John Cocke and Peter Markstein at IBM [5] and was a
more general technique than the similar, independent implementation by Scarborough and
Kolsky [17].
30
Chapter 3
Efficient Profiling
3.1
Introduction
In this chapter we describe how the application being compiled was processed so that:
1. profiling the output code could be done efficiently,
2. the data returned contained enough frequency information to allow all the control
flow statistics to be determined.
Proceesing the application in this way has two direct advantages. First, more profilers
are compatible with the compiler since a profiler need not be able to instrument all the
jumps and branches within a program to be compatible. Second, performance will be
improved for most profilers that are capable of providing all the control flow statistics, since
they typically provide that information by instrumenting all jump and branch instructions.
In this implementation there are fewer instrumentation
profiling process will be faster overall.
points and so we expect that the
3.1.1 Description of the Problem
Profilers give us frequency information about the execution of a program on a particular
input. The type of frequency information returned varies widely with profilers and can
range from estimates on the time spent within particular basic blocks to an actual tally of
the number of times a particular instruction was executed.
While a frequency count of each instruction executed is extremely useful for pointing out
interesting parts of a program (like hotspots), it is often not enough for some types of
analysis. Typically we not only want to know the number of times an instruction was
executed, but we also want to know the context in which that instruction was executed,
and we want statistics on that context. To illustrate the point, imagine we had a bounds
checking routine that called the same error handling routine with different arguments for
different errors. If we were curious to know which particular error happened most frequently,
31
it would be insufficient to know only the total execution tally of the error handling routine.
We would need to know the number of times control was transferred from the point of error
detection to the error handling routine.
This kind of frequency information for the context of instruction execution is usually available as a standard part of most profilers, however there is usually a substantial price in
performance to pay for the added information. For example the xproft profiler [14] suffers up to a 1500% performance decrease when keeping additional statistics for the control
flow between the nodes executed. This performance degradation comes mainly from the
instrumentation involved with each branch instruction. Instrumentation is the process of
modifying a part of the original program so that tally information is automatically updated
everytime that code fragment is executed. Instrumentation usually involves inserting a
branch instruction before the code fragment under investigation. The target of this branch
instruction contains some code to increment the counter for that code fragment, followed
by a branch instruction back to the code fragment. Hence to obtain all the context statistics (edge frequencies), the instrumented code is larger and usually slower than the original
code since the modified code now involves extra computation (including at least one other
branch) per branch instruction to increase its counter.
3.1.2
Efficient Instrumentation
Clearly we would prefer to have the profiler instrument as little of the code as possible.
However, we would also like to obtain the statistics concerning the context of the execution
of all the basic blocks. For our purposes, we assume that the profiler cannot be altered,
and that it has the capability of returning the frequency information for all the basic blocks
in a program. We shall demonstrate how to obtain all the desired statistics, using a given
profiler that returns at least the basic block execution frequencies, by instrumenting selected
sections of the code before applying the profiler to it.
The reason for the restriction on not altering the profiler is that we would like to consider
the profiler as a tool for which the behaviour is limited to that specified by its input/output
specifications. We make this clarification because there are techniques for reducing the
amount of work done in instrumentation beyond even what we describe here [2]. However,
those techniques all involve modifying the profiler. An example application of one of those
techniques would be to have the profiler not instrument some nodes whose frequencies
could be derived from other nodes' frequencies. While these techniques are useful to those
designing profilers, they cannot be applied by end-users of the profiling tool. The techniques
we describe show how to use the profiling tool to efficiently return all the information we
want without having to know implementation details about the tool.
In maintaining our stance of keeping the compiler and profiler separate, we were forced to
modify the program rather than the profiler so that we could obtain the information needed
efficiently. Had the implementation been an integrated one, where the compiler and the
profiler were built as one tool, then the profiling techniques could have been modified to
obtain the information - possibly even more efficiently. Notice that there was, nonetheless,
not much loss in efficiency since having the profiler integrated would in no way help the
1This is the profiler that was used throughout the development of this thesis work.
32
compiler to modify the flowgraph. The possible increase in efficiency would have come about
because of improvements in profiling techniques which might have avoided doing extra work
when the profiler could detect that it was safe to do so.
3.2
Background To The Algorithm
It is clear that it is possible to obtain enough information using just a basic block profiler
to deduce all the necessary control flow information since we could simply instrument every
branch instruction of the original program (in the manner the profiler would have done it
in the first place). But this is inefficient as pointed out earlier and we shall provide upper
and lower bounds on the amount of code that will need to be instrumented. In addition,
the algorithm described herein, is guaranteed to instrument the minimum amount of code
necessary.
To argue formally about our algorithm, we model a program as a control dependency
flowgraph where each node represents a basic block from the program. A directed edge
connects nodes u and v if it is possible for control to be transferred from basic block u to
basic block v. Thus, in the context of flowgraphs, our problem is to deduce all the edge
frequencies given a profiler which returns only node frequencies.
A weighted digraph is said to obey Kirchoff's current law (KCL) if for each node in the
graph the following three quantities are all equal: the sum of the edge weights leaving the
node, the sum of the edge weights entering the node and the node weight.
Note that Kirchoff's current law puts severe constraints on the weighted digraph. In particular, there can be no nodes that do not have edges either going into or coming out of
the node. A control dependency flowgraph weighted with frequency counts on the nodes
disobeys KCL only at the entry and exit points in the graph, so the law permits analysis
of the flowgraph only on the internal nodes (ie those that are neither entry nor exit nodes).
This problem can be solved by creating a single entry node going to all other entry nodes
and similarly, a single exit node to which all other exits are connected, then finally adding
an edge from the exit node to the entry node. We use this construction in our later proofs,
but since that last edge added is never instrumented, in practice we do not insert it. The
following theorem demonstrates that it is indeed possible to gather all the edge weights using only node weights. It is merely a formal restatement of the informal argument presented
above about instrumenting every edge.
Theorem 3.1 Given a profiler, P which only returns node counts of a program's control
dependencyflowgraph, it is possible to deduce the edge counts using only the information
from P.
Proof: Let G be the control dependency flowgraph. Then G = (V, E) where V is the set
of vertices and E the set of edges, represented as ordered pairs so E C V x V. Let p(v)
represent the weight of a vertex, v, in V as assigned by profiler P.
To obtain weights for each edge in E, we construct a graph G' = (V', E') from G by replacing
each edge (u, v) with a new node w and new edges (u, w) and (w, v) as in Figure 3.1. This
construction is equivalent to instrumenting each edge.
33
U
Vf
Figure 3.1: Replacing an edge with an instrumenting node
The new graph G' will have the following properties:
1. IV' = IVI+ EI
2. IE'I = 21El
3. VCV'
4. V(u, v) E E 3w E V' - V such that:
(a) {(u, w), (w, v)} c E'
(b) (u, v)
E'
(c) Vx E V'[((x, w)
E'
x = u) A ((w, x)
E'
x = v)]
(d) The conditions under which edge (u, w) is traversed in G' are exactly the same
as for when the edge (u, v) would have been taken in G.
From property 4, IV' - V[ > IEl. So properties 1 and 3 together imply that IV' - VI = IEl.
Therefore, Vv E V'- V 3 a unique (ul, u2 ) E E such that (ul, v) E E' and (v, u 2 ) E E'. Now
by KCL, p(v) = weight (ul, v) = weight (v, u 2 ). All the conditions of property 4 ensure
that weight (ul, u 2)
=
weight (ul, v) = weight (v, U2 ). Hence, p(v) = weight (ul , u 2 ).
Since the last property guarantees that there is a corresponding node in V' - V for each
edge in E, then the information returned by P on G' is sufficient to determine all the edge
weights of G. ·
Theorem 3.1 used a construction which involved adding IEl nodes to the graph, where IEI
is the number of edges in the original graph. We shall show that the edge information can
always be obtained with fewer than IEl nodes added. We shall also show that the algorithm
presented is optimal in that it adds the minimal number of nodes required to deduce all the
edge weights.
3.3
Description of The Algorithm
The overall algorithm is broken into three stages:
1. Choose the edges to be instrumented
34
2. Instrument the chosen edges
3. Measure the node frequencies (weights) and solve for the edges.
The most important stage is the first since the edges marked chosen will be those to be
profiled. The number of these so marked edges will be the determining factor on the time
that profiling takes. When we say that this algorithm is optimal, we mean that it chooses
the fewest possible edges to allow the rest of edges to be deduced from only the node weights
measured.
Before discussing the algorithm in detail, we shall need the following definition.
Definition 3.1 A weightedadjacency matrix, W, is the matrix constructedfrom a weighted
digraphG = (V, E) such that:
j
weightof edge(i,j) if (i,j) E E
l O
otherwise
The following construction will also be useful in proving the correctness and optimality of
the algorithm.
Let Aw be the weighted adjacency matrix of G, a connected graph obeying KCL, and Wv
be the column vector of node weights such that Wvi = weight of node i. Using KCL, we
can get two systems of equations: one for the outgoing edges of each node, and the other
for the incoming edges of each node. That is, if c is a constant column vector of size IVI
such that Vi ci = 1 then we may write:
Awe
(cTAw)T
= Wv
= Wv
(3.1)
(3.2)
The second equation can be rewritten as:
AT
= Wv
(3.3)
As an example, consider the graph in Figure 3.2. The nodes and edges have been numbered
in some arbitrary order, and the resulting adjacency matrix would be:
O
O
el
e2
0
0
e3
e4
O e6
0
0
Note that in general, the matrix Aw has no zero rows or columns since every node is connected to at least one other node, and every node has at least one node connected to it. (This
was guaranteed by construction and the conditions on G). Therefore, Equations 3.1 and 3.3
represent two systems each with IVI equations and involving the same variables.
35
e6
Figure 3.2: An example of a control flowgraph
3.3.1
Choosing the Edges to be Instrumented
First we construct the weighted adjacency matrix, Aw, for the given control flowgraph, G,
using a unique variable to represent each edge weight, since they are as yet unknown. There
are two possible marks that can be assigned to a variable: known or chosen, and at the
beginning all variables in Aw are unmarked. We proceed as follows:
1. For each row, Awi, of Aw,
If Awi contains exactly one unmarked variable then mark that variable as known.
2. For each column, ATi, of Aw,
If ATi contains exactly one unmarked variable then mark that variable as known.
3. If all variables have been marked (either known or chosen) then halt.
4. If at least one variable was marked in the most recent pass then go back to step 1.
5. Choose any variable that is unmarked and mark it chosen.
6. Go back to step 1.
3.3.2
Instrumenting the Chosen Edges
After all the variables have been marked, we instrument the edges corresponding to the
variables that were marked chosen. In the model of the graph, this means adding a node
with exactly one incoming and one outgoing edge. The source of the incoming edge and
the destination of the outgoing edge are exactly the same as the source and destination,
respectively, of the edge being instrumented. The edge being instrumented is then deleted
36
from the graph. In terms of the program, this means that the branch instruction corresponding to the edge being instrumented is modified so that its target is now the address
of the instrumenting code. At the end of that instrumenting code is a branch instruction
with the same branch target as that of the original branch instruction.
Under this construction, all the conditions of property 4 in the above theorem are satisfied. In particular, condition (d) is satisfied since only the branch target of the branch
instruction is modified. Hence, any condition which would have transferred control to the
original branch target address will now transfer control to the basic block containing the
instrumentation code. Under the graph model this means that whenever the removed edge
would have been traversed, the incoming edge of the new node will be traversed.
3.3.3 Computing the Results
In the final stage of the algorithm, we pass the modified program to the profiler which
returns the execution frequencies of all the basic blocks in the modified program (ie the
node weights). Note that the order in which the edge variables were marked known implies
the order in which they should be solved. Since any variable that becomes the lone variable
in its column or row after another was marked known will only be deducible after that
marked variable has been solved for.
As an optimization, we constructed the expression for each variable, as it was being marked,
in terms of the node frequencies and the other edge frequencies. Then after the profiler
returned the frequencies for all the nodes and the instrumented edges, the values were sub-
stituted into the expressions which were then solved in the order that they were constructed
(the same as the order in which the variables were marked). This optimization allowed the
all the edge frequencies to be deduced in one pass, since no expression would be attempted
to be evaluated until all the information for it was available.
3.4 Correctness of the Algorithm
We first show that the algorithm is correct in that it will return the correct execution
frequencies for the edges so long as all the reported node frequencies are correct.
Claim 3.4.1 The algorithm for labellingthe variablesterminates.
Proof: Step 4 guarantees that on each pass, at least one variable will get marked either
known or chosen. Since variables are never remarked then the total number of unmarked
variables decreases by at least 1 per pass. Since the total number of unmarked variables at
the beginning of the stage is finite then the condition in Step 3 will eventually evaluate to
true, causing the algorithm to halt. ·
Claim 3.4.2 If a variable e is marked chosen then given the node weights of the flowgraph,
the value of e can be uniquely determined.
37
Proof: By construction, the frequency of the instrumenting node for the edge corresponding
to e is equal to e. Since this frequency will be returned by the profiler after stage 3, the
value of e will be uniquely determined. ·
Claim 3.4.3 If a variablee is marked known in some pass t, then given the node weights
of the flowgraph and the values of all the variables already marked on pass t, the value of e
can be uniquely determined.
Proof: We prove this by induction. If a variable is marked known on the first pass then
it must have been the only variable in either its row or column according to steps 1 and 2.
The value of this variable is immediate once the frequencies have been measured as implied
by equation 3.1 and equation 3.3.
Assume now that on pass t - 1 every variable that was marked known was correctly computable. If e is marked on the tth pass then it must have been the only unmarked variable
in either its row or its column on pass t - 1. Since e will be marked on pass t, it must be
that all the other variables in its row (column) have been marked either known or chosen.
Given the values of all the variables marked before pass t, we can find the sum of all the
other variables in the same row (column) as e. The value of e is then given by the difference
of the measured node weight and the computed sum of previously marked variables (i.e.
the edge weights) as implied by equation 3.1 (equation 3.3). ·
It follows from the preceding claims that the algorithm is correct. The first claim guarantees
that all the variables get marked and the second and third show that a marked variable can
be correctly computed. Hence all variables will be correctly computed. That is to say that
the algorithm returns correct edge weights based upon the node weights returned by the
profiler.
3.5
Optimality of The Algorithm
The proof of optimality will require the use of a few Theorems from Linear Algebra.
Theorem 3.2 If A is an m x n matrix with rank n, and x is a n x 1 vector, then the
equation: Ax = 0 has x = 0 as its only solution, if indeed, that solution exists.
Corollary 3.2.1 The equation Ax = b, for some m x 1 vector b has a unique solution.
Theorem 3.3 Exchanging any two columns of the coefficient matrix of a system of equations will not alter the constraints of the system on the variables.
Proof: Consider the system
Ax = b
If A' is obtained from A by exchanging columns i and j, then we can construct x' and b' by
exchanging elements i and j of x and b respectively. Now, the system A'x' = b' is identical
to the given system, hence the constraints of the two systems are also identical.
38
3.5.1
An Alternate Matrix Representation
Recall equations 3.1 and 3.3, within Aw there are exactly El variables each one representing
an edge in the graph. So we could rewrite each system in terms of a coefficient matrix of
l's and O's, the vector of all the El variables, and the IVI node weights. That is, for
equation 3.1, say, we could write:
MRWE = Wv
where MR is the coefficient matrix and WE is the vector of edge weights. Note that MR is
a IV x IEl matrix and that by writing the right hand side as Wv, we have forced the ith
row of MR to correspond in some manner to the ith row of Aw and hence to the ith node.
This correspondence is actually quite straightforward to see, we simply have to make the
expression MRiWE algebraically equivalent to the expression Awic, where MRi and Awi
refer to the ith rows of MR and Aw respectively. So for example, earlier we had
0
0 el
e2
0
0
e3
e4
0
0
0
0 0
e5
O e6
From the first row of Aw, we see that variables el and e2 sum to the frequency of node 1
so that MR will have two 's in the first row in the column positions corresponding to el
and e2 . The rest of MR is given below.
MR=
00 100
1
1
0
0
0
0
0
1
0
0
0
This construction yields the following properties within MR:
1. MR is a IVI x
EI matrix of l's and O's.
2. Each row of MR contains at least one 1.
3. Each row of MR has at most IVI 's in it.
4. Each column of MR has exactly one 1 in it.
The first property is obvious from the dimensions of WE and Wv. The second and third
properties come from the fact that none of the rows of Aw were all zero and that its
dimensions were IVI x V, which means that there was at least one edge weight variable
but at most IVI of them in a row. The last property may seem a little startling at first, but
is actually quite simple to see why. By the constraint governing the construction of MR we
see that if element MRij is a 1, then the variable WEj appears in the row Awi. Now since a
variable can be in at most one row of Aw (that is, every edge has at most one source node)
then each column of MR has at most one 1. Also since the variable was obtained from Aw
in the first place, it must sit in some row. (That is every edge has at least one source node)
Therefore, each column of MR has at least one 1; hence exactly one 1. The last property
is, in fact, just another way of stating that every edge has exactly one source node.
39
We can do a similar restatement of the system implied by Equation 3.3 to yield a coefficient
matrix Mc. The important difference is that the constraint on the construction of M is
that MCiWE must now be algebraically equivalent to AT c. Now, Mc has all the same
properties as MR does, but they are true because of the column properties of Aw instead of
its row properties (since the rows of AT are the columns of Aw). The equivalent restatement
of property 4 for Mc would be that every edge has exactly one destination node. So the
corresponding Mc for our current example would be
O0000 1
10 1000
0
1
0
100
Finally, we can combine the two systems into one to write:
[ Mc
MR]WE
(3.4)
[ WV
WV]
or
(3.5)
MWE = WN
where M =
and WN
MR
Mc
[WV
Wv
That is, M is the 21V x El matrix obtained by placing the rows of MR above those of Mc,
and WN is the 21VI x 1 column vector obtained by concatenating Wv to itself. This final
form will allow us to prove the theorems following.
3.5.2
Bounds for the Number of Instrumented Edges
Theorem 3.4 Given a control dependency flowgraph G = (V, E) with no unreachable
nodes, if El > 2VI then the edge weights cannot, in general, be uniquely determined if
given only the node weights of the unmodified graph, G.
Proof: We shall first construct a graph G' = (V', E') which has G as a subset, and will
obey KCL at every node. To do this, if there is at least one exit node, we create a single
entry node and a single exit node which are linked by an edge going from the exit node to
the entry node. (Note that we will always have at least one entry node). Formally, let S
and T be the sets of entry and exit nodes respectively. We obtain V' and E' as follows:
V'
=
E
=
{ V U s}
VU {s,t}
if ITI > 1
otherwise
EU ({s} x S)U (T x {t})u {(t,s)}
if ITJ> 1
EU ({s} x S) U{(s, s)}
otherwise
where s and t are distinguished nodes not already in V. Note that there are no nodes in G'
without either incoming or outgoing edges.
G' obeys KCL by construction and a consistent set of weights is obtainable if the weights
of nodes s and t and edge (t, s) are all set to 1. This is also consistent with the notion that
40
the program corresponding to the flowgraph exits once. (Indeed it would be unclear as to
what it would mean for there to be a weight greater than 1 on node t)
If it were possible to solve for the edge weights of G given the conditions in the theorem
statement, then it would also be possible to solve for the edges in G' since all the edges in
E' - E are immediately determined once the node weights are known. Hence, by showing
that it is impossible to solve uniquely for the edges of G' we can conclude that the theorem
statement is true.
Case
1:
EI > 21VI
The two systems above yield 2iV I equations in the edge weight variables, therefore in general,
there are not enough equations to solve for all edges.
Consider matrix M: there are 21VI rows, therefore Rank(M) < 2V I. Since the number
of columns is equal to El, Theorem 3.2 implies that the solution to the vector w is not
unique. Thus, all the edges cannot be uniquely solved in general. The edges can possibly
be solved uniquely if enough of them can be determined to be zero - by being connected to
a node with zero weight - so as to reduce the number of unknown edge weights to less than
21VI.
Case2: IEI = 2VI
Consider matrices MR and MC. In both matrices the sum of the rows is equal to cT.
Therefore, in matrix M at least one row can be reduced to all zeros. Hence Rank(M) < 21VI
which by Theorem 3.2 implies that the edges cannot be uniquely solved in general. ·
Theorem 3.4 tells us that if the given graph has at least as many edges as twice the number of
vertices then the node frequencies alone will not be sufficient to compute the edge frequencies
uniquely in the general case. Note that we are not considering the added information that
there might be in the actual numbers. For example, if we knew that the frequency at a
particular node was zero then all the edges going into and leaving it would be forced to
have zero weight. Rather, the theorem refers to the a priori ability to determine the edge
weights - that is, before we know what any of the node weights actually are, we want to
know how many edges we will be guaranteed of being able to solve for.
Having thus established an upper bound on the number of edges that we could hope to
being able to solve, we can also establish a lower bound. That is, we can determine what
the minimum number of edges is so that we are guaranteed to being able to solve for all their
weights uniquely only from the node frequencies. This lower bound is stated and proved in
Theorem 3.5.
Theorem 3.5 Given a control dependency flowgraph G = (V,E) with no unreachable
nodes, if El < IVI + 1 then the edge weights can, in general, be uniquely determined if
given only the node weights of the unmodified graph, G.
Proof: Case 1: IEI < IVI + 1
Since there are no nodes without edges leaving or going into them, then E > IVI. So this
case reduces to E = IVi. In this case, M is a 21V x IEI matrix. All the properties of
the sub-matrices MR and MC, of which M is composed, still hold. So from property 2, we
see that each row of MR and Mc contain exactly one 1. There are exactly IVI different
41
possible rows of length IVI that contain exactly one 1, therefore every row that appears in
MR must also appear in Mc. Hence by row operations on M, we can obtain exactly IVi
rows of zeros. If we choose to reduce those rows from the Mc sub-matrix, we see that
Rank(M) = Rank(MR) = IVI
So by Theorem 3.2, there is a unique solution to WE.
Case 2: El = IVI + 1
By the pigeon-hole principle, there will be exactly one row, MRi, in the MR sub-matrix and
exactly one row, Mcj, in the Mc sub-matrix that will contain two 's. Now, it cannot be
the case that MRi and Mcj contain both their 's in the same column positions since that
would imply that there were two distinct edges going from node i to node j which violates
the definition of the graph. Therefore, there is at least one column of M that contains a 1
in row MRi but not in row Mcj. Since that column must have two l's, then the 1 from Mc
is the only 1 in the row that contains it. Call that row Mck. Then Mck and MRi are two
linearly independent rows which involve two edge variables. Therefore, both those variables
can be determined given the frequencies of the nodes. A similar argument holds for Mcj,
so both the variables that it involves can be deduced only from the node frequencies. Since
all the other rows contain only one 1 each, the values of the edge variables corresponding
to the positions of those 's are immediate from the node frequencies. Hence all the edge
variables can be deduced. ·
3.5.3
Precisely Expressing the Required Number of Instrumented Edges
We now have a upper bound on the number of edges to tell us when we can solve uniquely
for them all, and a lower bound to tell us when we certainly cannot solve for the edge
frequencies uniquely. But what about when the number of edges is between these two
bounds? That is, we would like to know what criterion there is - if any - that determines
whether or not we can solve the edges uniquely given that IVI + 1 < IEl < 2V1. It turns out
that it is when El is in this range that the number of edges that need to be instrumented
is determined only by the topology of the graph. This is true because the topology of the
graph determines the dependencies among the rows and columns of M and hence its rank
and the sizes of its null spaces. The rank of M, in turn, determines the number of edges that
can be determined uniquely, and so by observing the rank of M, we can decide how many
edges will have to be instrumented. We formalize the preceding argument, and produce
the exact number of edges that will be instrumented in terms of the row and column space
attributes of M.
Lemma 3.1 The minimum number of edges that need to be instrumented in orderfor the
rest of them to be determined uniquely is El-Rank(M).
Proof: Instrumenting an edge forces the profiler to measure its frequency directly, so is
equivalent to replacing the variable corresponding to the edge in the vector WE with the
measured frequency. Let us suppose that we instrumented n edges and we were able to
solve for the rest uniquely. Then we could replace n of the variables in WE with constants
representing the frequencies measured. By Theorem 3.3, we can exchange the elements of
42
WE and the corresponding columns of M in such a way as to make the last n entries in
WE be all constants. Thus, the elements of WE are partitioned into two sub-vectors: a
(IEI - n)-length vector, WE, containing all variables, and a n-length vector, WE containing
all constants. Correspondingly, M is partitioned into two sub-matrices: a 21VI x (EI- n)
matrix, M', and a 21VI x n matrix, M". So Equation 3.5.1 becomes:
MWE
=
[M'M"] [ WE]
M'W
M"W
= WN
soM'WE = WN- M"WE
(3.6)
The right hand side of the last equality will be a known vector after the profiler returns
the node weights, and since by assumption we could solve the original system after instrumentation, Theorem 3.3 guarantees that we can solve the derived system. Then, since M'
has EI - n rows, Theorem 3.2 implies that Rank(M') = EI- n. But by construction,
Rank(M') < Rank(M) so EI - n <Rank(M) . That is, n > IEI-Rank(M). ·
Corollary 3.5.1 The number of edges that need to be instrumented is dim(AR(M)) where
NAR(M)is the right null space of M.
Proof:
Since M is a 21V x EI matrix,
Rank(M) + dim(AR(M)) = IEI
If n is the minimum number of edges that need to be instrumented, Lemma 3.1 shows that
n =
=
IEl- Rank(M)
dim(A/'R(M))
(3.7)
(3.8)
Corollary 3.5.2 Provided IEJ > JIV,IVI < Rank(M) < 21VI- 1
Proof:
1. Theorem 3.4 tells us that if IE > 21VI then we need at least one instrumented
edge. In particular, if IE = 21VI then n = IEI-Rank(M)
> 1, which implies that
Rank(M) < 2V - 1.
2. Theorem 3.5 indicates that we don't need any instrumented edges if IEl < IVI + 1.
Since there are no unreachable nodes in the flowgraph and it obeys KCL, we also
knowthat IEI > IVI. So for n = IE - Rank(M) = 0, we obtain Rank(M) = IEl when
IVI < E < IVI + 1 implying that Rank(M) > IV so long as EI < IVI + 1. Now
we note that Rank(M) must be monotonically nondecreasing since when new edges
are added, at worst, they are all in the null space of the matrix, M, which does not
change the value of Rank(M). Hence, we have Rank(M) > IVI in the general case
when E > JIV.
43
.
Corollary 3.5.3 puts bounds on the rank of the edge coefficient matrix, M, given that there
are enough edges to connect all the vertices to obey KCL. The exact value of Rank(M) is
dependent on the topology of some subset of the graph containing all the vertices and at
most 2VI - 1 edges, and not on the total number of edges.
3.5.4
Proof of Optimality
We shall now show that the algorithm will select exactly dim(AJR(M)) edges to be instrumented and hence that it is optimal. First we show that by obtaining the value of a variable
through instrumentation, we can reformulate the system of equations specified by Equation 3.5.1 using a matrix M' with one less column and without using the variable that was
marked "chosen" . We then prove that the conditions under which a variable is marked
"chosen" guarantee that none of the variables unknown at the time could have been derived
from the ones already marked. That is, the variable marked "chosen" is independent of the
other edge variables.
These two conditions imply that if M' is the coefficient matrix obtained after reformulating
the system and M was the matrix before, then dim(AfR(M')) =dim(NAR(M))- 1 since only
one variable was marked and it was an independent variable. Since each instrumentation
reduces the dimension of the right null space of the coefficient matrix by 1, we can instrument
at most dim(AJR(M)) edges. By Lemma 3.1 we know we cannot instrument any less than
dim(AR(M)) edges implying that we instrument exactly dim(AR(M)) edges.
Independence of chosen variables
For this part of the proof of optimality, we use matroids as a repesentation of matrices, to
simplify the reasoning about independent rows. The section on matroids describes matroids
formally and shows how the set of rows and the sets of linearly independent rows of a matrix
form a matroid. In this instance, we define a matroid M = (S,1) where S is the set of
rows of M, and a subset of S is in I if the rows it contains are all linearly independent.
As constructs for the proof, we will also need the followig functions on vectors and sets of
vectors. For r = (rl, r 2 ,..., rn), an n-dimensional vector we can define:
C.(r) = {ilri
O}
and extending this to sets of vectors, we get
C(A) = U C.(r)
rEA
Lemma 3.2 If every row of M contains at least two 1's then C(A)I > Al for every set A
inI
Proof: Let A be some subset of S in I. We prove IC(A) > Al by induction on the subset
sizes of A. When IBI=1 for some B C A then C(B)l > 2, so the lemma is true for the
44
basis case of a single row. Assuming that the lemma holds true for all the proper subsets B
of A of size k, by the exchange property of M we know there is some r E A - B such that
B U {r} E 1; that is, r is independent of all the rows in B.
Let B' = BUr, we need to show that the inductive hypothesis implies that IC(B')I > IB' =
IBI + 1. Since ICI is monotonic increasing we know IC(B')I > IC(B)I and IC(B)I > IBI + 1
by hypothesis. So IC(B')I > IB'I and we only need to show that it is impossible for
IC(B')I = IB'I.
Suppose IC(B')I = B'I and consider t, the total number of l's in all the vectors within B',
we can obtain the possible range of values for t from the assumption and the properties of
M. Since there are at least two l's per row and at most two l's per column, we obtain
21B'I
t < 2C(B')J. But IC(B')I = IB'I by supposition, therefore t = 2C(B')I = 2B'I.
This means that every row in B' contains exactly two l's and for each 1 in each row, r, of
B', there is another row r' in B' that contains a 1 in the same index position. From the
construction of M, one of r and r' originated from MR and the other from Me.
Hence we conclude that IB'I is even, and that the sum of all the rows in B' that came from
MR is a vector with a 1 in every index position in C(B'). The same is true for all the
rows of B' that came from MC, and by subtracting these two sums, we obtain a zero vector
implying that there is a linear dependence within the vectors of B' This contradicts the
exchange property of M from which B' was constructed, and so it must not be the case
that C(B') = IB'I. That is IC(B')I > IB' follows from the inductive hypothesis, and so
we conclude that for every A in I, IC(A)I > Al. ·
Theorem 3.6 If the algorithm marks a variable "chosen" then none of the unmarked edge
variablesimmediately prior to that marking could have beendetermined solelyfrom theflow
constraints.
Proof: First we note that the algorithm only marks an edge variable "chosen" when every
column or row within the current weighted adjacency matrix contains at least two variables.
The corresponding state of M is that every row of M contains at least 2 l's. Thus the
stipulation of Lemma 3.5.4 is satisfied.
To be able to determine the value of an edge variable, there has to be a set of independent
rows of M that do not involve any more variables than there are rows in the set. Formally,
there must be some A in I for which C(A)I = Al since IC(A)I is the number of variables that
are involved with a system containing a matrix made up of the rows within A. Therefore,
invoking Lemma 3.5.4, we see that it is never the case for any A in I that C(A) = AI
and so we conclude that indeed there is no set of rows of M which could lead to a unique
solution for the variables still unknown.
Reformulation After Instrumentation
Since marking a variable "chosen" means that its corresponding edge will be instrumented,
the value of the variable can be considered as given. Let the selected edge be edge variable
i, then we can reformulate the system of equations by:
45
1. Deleting column i from M to obtain M'.
2. Deleting element WEi from the vector WE to get WE
t.
3. If the two rows of M that contain a 1 in the i-th column are the j-th and k-th rows
then we obtain WN' from WN by assigning WNj' = WN- WEi and WNk' = WNk- WE
For example,
1
1
0
0
0
0
el
0
0011100
0000011
00 100 10
1001001
0100100
ni
e2
e3
72
n3
e4
e5
ni
2
e6
n3
e7
would be transformed as follows if edge 5 were chosen to be instrumented:
1
1
0
0
0
0
el
0
0
0
0
1
0
1
0
n 2 -e
1
0
1
e2
0
e3
0
1
0
0
1
0
1
1
0
0
1
e4
0
e6
n3
nl
n2
0
1
0
0
0
0
e7
n 3 -e
-
n
5
5
Note that the new matrix M', like M also has following properties:
1. All the entries are either 0 or 1.
2. Each row has at most IVI 1's in it.
3. Each column has exactly two 's in it.
These properties, invariant on the coefficient matrix as the algorithm proceeds, guarantee
that at every choice of a variable for instrumentation, all the conditions for Lemma 3.5.4 are
satisfied. Therefore Theorem 3.6 holds for each successive M', and hence we are guaranteed
that every instrumentation was necessary.
3.6 Running Time of the algorithm
Instrumenting the chosen edges and solving the expressions for the edge variables are relatively inexpensive steps especially since they can both be done in one pass of the variables.
Therefore, the running time of the whole algorithm, not including the time for the profiler
to return the frequencies, is dominated by the time to choose the edges to instrument.
The algorithm for choosing edges to be instrumented can search through at most IVI rows
and columns while checking for lone variables. Each check is cheap since there can be an
array for each row and column which keeps track of the number of variables in that row or
46
column and which variables are there.
list for its row and its column and the
algorithm was actually implemented.
marked and we mark all the variables,
When a variable is marked, it is removed from the
sizes of the lists updated. This was, in fact, how the
Since all this work is done for each variable that is
then the total number of steps is O(VE).
The running time of the algorithm is actually irrelevant for the purposes of this context,
but was included for the sake of completeness. The time that the algorithm takes will be
reflected in how long the compiler takes to complete a compilation, but will not affect the
profiling time at all. Since it is the latter that we are mostly concerned with when using
this algorithm, we try to minimize the number of edges instrumented.
3.7
Matroids
A matroid is an algebraic structure consisting of a pair of sets (8, 1) which has the following
properties:
1. S is finite.
2. I is a non-empty set of subsets of S such that if A E S and B C A then B E Z. This
property is called the hereditaryproperty of Z.
3. I satisfies the exchange property: if A and B are both in I and if AI > IBI then there
is an element x E A - B such that B U {x} E I.
We can obtain a matroid from a matrix by using the rows of the matrix as S, the finite set
of the matroid pair, and considering a subset of it independent (i.e. belonging to I) if the
rows in that subset are linearly independent. Using the columns instead of the rows also
yields a matroid, this type of matroid, derived from a matrix, is called a matric matroid [7].
Although originally inspired by their application to matrices [19], matroids are applicable
to many other areas of Computer Science. For example, the graphic matroid, derived from
a graph where the set S is the set of edges and a subset, A, of S is considered to be
independent (in ) if and only if A is acyclic. In particular, matroids are very useful for
determining when a greedy algorithm yields an optimal solution, because they exhibit the
optimal-substructure property. This property states that the optimal solution to a problem
contains optimal solutions to its subproblems, so that local optimizations at each stage
yield a globally optimal solution. The minimum-spanning-tree problem can be expressed
and solved using matroids, as well as a variety of other algorithms that can be solved
greedily. Given that the algorithm described in this chapter is a greedy one, it is no wonder
that matroids were applicable.
3.8
Summary of Results
It has been shown that edge weight profiling can be done efficiently using only a node
weight profiling tool. The number of edges chosen to be instrumented are minimized and
the edges so chosen are not affected by the topology of the graph. The topology of the
47
Figure 3.3: Redundant Node Information
graph yields particular properties of the null spaces of the corresponding matrix for the
flowgraph, which affect the number of edges that are chosen for instrumentation, but not
which ones in particular.
3.9
Discussion
The algorithm is optimal in the number of edges that it instruments, and works on arbitrary
flowgraphs. We are guaranteed that it will never instrument all the edges and so in all
instances, the algorithm performs better than a naive solution to the problem. It processes
the graph in a purely syntactic fashion in that its only concern is the topology of the graph
and not any meaning that may be ascribed to the nodes and edges. For example, one
possible optimization would have been to check if loop bounds were constant. If so, then
the frequencies for some of the edges within the subgraph representing the loop might be
deducible. In particular, the frequency of the back edge for the loop would be known.
However, this example is a semantic optimization in that it looks at the content of the node
and tries to deduce properties of the graph based on the meaning of that content.
The constraints imposed by the compiler and profiler being separate prevent even some
syntactic optimizations. For example, if we had the situation in Figure 3.3 then we would
not need the frequency information for node 4, since it would be equal to that of node 1.
However, such optimizations could not be implemented, because the profiler is assumed to
return all the basic block information without making any distinctions between them. Of
course, that particular optimization could very well be one within the profiler itself, but
the point is that the compiler has no control over how the nodes are measured, so it cannot
attempt to improve the performance in general by being clever about the nodes.
Another possible optimization that was not explored is the use of the values obtained in
previous runs to hint at which variables to choose to instrument. First of all, if a node
48
had frequency zero then automatically, all the incoming and outgoing edges would have
frequency zero. Secondly, when we have to choose an edge to be instrumented, instead of
choosing any available edge, we may actually wish to bias our choice towards the edge that
tends to be traversed less frequently. These optimizations involve feedback not only within
the compilation process, but also in the profiling. Therefore, they are quite practical in
an adaptive dynamic setting where programs are constantly being profiled, possible in the
static setting where a database of application profiles is kept, but infeasible in the setting
of this implementation. Were we to attempt to use feedback in the profiling phase of the
compilation, we would have to do two passes at the profiling, which is already the slowest
step by orders of magnitude, in the whole compilation process.
3.10
Bibliographic Notes
The xprofprofiler was implemented by Ravi Nair at IBM T. J. Watson Research Center [14].
This profiler had the feature that it could profile binaries that had not been instrumented by
the compiler, unlike other profiling programs like prof and gprof. He was also instrumental
in the development of the algorithm presented here.
Ball and Larus present methods for optimally profiling programs [2] when the profiler has
complete control over where counters ought to be placed in the flowgraph to obtain all the
information accurately. In our implementation, we are constrained to instrument only edges
and every node will be profiled regardless of whether it needs to be or not. Therefore, the
constraints on this sytem are slightly more restrictive than those under which they worked.
Some of their lower bounds and ideas concur exactly with our results, despite this disparity
of constraints.
Cormen, Leiserson and Rivest present a more detailed description of matroids than was
presented here [7]. The section on matroids was taken almost wholly from the book. They
also indicate sources for a generalization on matroids called greedoids.
49
Chapter 4
Optimizing with Feedback
4.1
Introduction
The optimizations that can take advantage of feedback information are probably the most
interesting part of the whole compilation process; they certainly are among the least understood aspects of it. There are two related but distinct issues that are important and need
to be considered:
1. What kind of data should be used in the feedback process to drive the optimizations.
2. How should feedback optimizations be ordered, among each other and among conven-
tional optimizations.
One thing to bear in mind is that feedback based optimizations are probabilistic.
That
is, they are not guaranteed to improve the performance of the code fragment that was
transformed under all inputs. Unlike traditional optimizations that are, for the most part,
oblivious of the input to the program, and will generally reduce the execution time for
the code fragment that was transformed, feedback based optimizations are highly input
dependent. This means that it is possible for a feeback based transformation to actually
degrade the performance of a code fragment on certain classes of inputs. The justification,
though, for using this class of transformations is that based on the feedback information,
we expect that if the future usage patterns of the program correlate well to the past usage
patterns, then the transformations we perform will have a net benefit for most of those
anticipated inputs.
The business of assessing an optimization that is based on feedback data is an open topic.
Indeed, what the right measurement for assessing the benefit of an optimization is, has many
possible answers, no one of which is easily refuted. We provide a more detailed discussion
of this in Chapter 5.
In the remainder of the chapter, we shall discuss the optimizations that could have been
done and how they would have been implemented using the features provided by the system.
We shall keep the discussion in the light of the two issues raised earlier, and how their
consideration would affect the implementation of the particular optimization under question.
50
4.2
Trace Based Optimizations
Trace based optimizations are those that are based on traces constructed within the control
flowgraph using the frequency information of the feedback system. Fisher describes the process of Trace Scheduling as a technique for microcode compaction on VLIW machines [10].
The process involves grouping instructions that are likely to be executed sequentially according to the profiled data. These instructions are collectively known as a trace and are
usually larger than a basic block; Fisher's technique constructs these traces allowing more
parallelism to be extracted so that the wider instructions result in shorter code. To maintain
correctness, extra code has to be included to ensure that when control leaves a trace, the
state of the program is as it would have been before the trace was constructed. Ellis details
the algorithms involved in building traces from basic blocks and describes the techniques
for maintaining correctness and controlling code explosion [9]. Akin to trace scheduling is
trace copying described by Hwu and Mahlke [20]. Trace copying is very similar to trace
scheduling, but has simplified maintenance routines with the cost of higher code expansion
than trace scheduling. All the optimizations that pertain to traces can be implemented
with traces constructed by either method.
Although originally designed for VLIW machines, trace scheduling and its related techniques
is applicable to uni-processor machines because the traces, typically larger than basic blocks,
expose more opportunities for code optimizations. Basic blocks are typically no more than
five instructions long, and so peephole optimizations on these basic blocks are limited in
scope and effect. The larger traces allow classic optimizations to exploit code properties
that exist across basic blocks. Hence, trace constructing techniques coupled with classic
code optimizations can provide very effective methods for improving code performance, as
described by Chang, Mahlke and Hwu [4].
These optimizations should probably be attempted before and probably even em instead of
the classic code optimizations. For transformations, like copy propagation, constant folding
and common subexpression elimination, that do not modify the structure of the flowgraph,
the global trace based optimizations would simply replace the classic ones since the basic
blocks would only be exposing a subset of the code exposed by traces. Other types of
optimizations that could modify the topology of the graph might present some difficulty in
maintaining the probabilities assigned to the edges in the graph, and so their feedback based
counterparts ought to be done first, and then, if necessary, another pass of the traditional
optimizations could be performed.
Although these optimizations were not implemented, this was no limitation of the design
of the feedback mechanism. The primary type of feedback information would be the edge
frequencies of the flowgraph which were already available in the system. These frequencies
would be used essentially for trace construction since the optimizations themselves are
mostly unchanged and are simply applied to the larger traces instead of to the basic blocks.
Therefore, from the point of view of having the feedback information, the system proves to
be adequate for implementing at least this class of optimizations.
51
4.3
Non-Trace-Based Optimizations
This class of optimizations use the feedback information directly on the code from the
flowgraph and may use more than just the edge frequencies. The type of information being
fed back is of particular importance here because it determines what aspects of the program
may be improved. For example, if the feedback information contains cache-miss information
per basic block then a possible optimization might be to rearrange the layout of the code
purely for the sake of reducing the memory access latencies due to cache misses. As another
example, suppose the feedback contained statistics about the contents of variables on a
per basic block basis. We could imagine that if a dispatch loop, for example, were being
optimized, we may be able to notice correlations with the procedures called and the data
observed. From this, the compiler might decide that it is beneficial to inline a procedure,
or to reorder code layout to improve instruction cache hit ratios. The point of both those
examples was to show that when we consider more information than simply the execution
frequencies, we provide opportunities for improving performance in ways that are beyond
the typical considerations for optimization.
4.3.1
Using Execution Frequencies
An important optimization that would benefit immensely from feedback execution frequencies is branch prediction. The idea is to exploit the unequal delays along the two branch
paths (in the RS6000 in particular) by rearranging the code. The reason for the unequal
branch delays is that the RS6000 assumes that the fall-through branch is taken whenever
a conditional branch instruction is encountered. Therefore, the taken branch path has a
longer delay than the fall-through branch path and so conditional branches are better off
arranging the tested condition in such a way that the path most likely to be taken is the
fall-through branch path. The dynamic information is used to determine the likelihood of
a particular condition evaluating to true, thus facilitating this optimization.
In super-scalar architectures the branch unit tries to resolve upcoming branches while other
computations take place in the other units. Sometimes it is not possible to schedule a
condition code evaluation early enough to allow the branch unit to completely resolve a
branch before the instruction is actually encountered. In some of these cases, frequency
information on which branch tends to be taken would be very useful. Knowing which branch
is more likely, the compiler could replicate the first few instructions (if semantically sound)
of the target and insert them before the branch instruction. This is effectively another
version of branch prediction, but exploits a different feature of the RS6000 architecture.
Frequency information can also help in measuring the benefits of certain optimizations
under conditions that can only be determined at run-time. For example, in code motion,
when instructions are moved outside loops, there is usually some overhead involved with
doing so. In the case where the loop is actually being executed a very small number of times
(though it could potentially be executed a large number of times) it may be more efficient
not to perform the code motion, since the overhead would cause a significant delay relative
to the time spent within the loop. Typically when these optimizations are impemented, the
loop is assumed to execute for some minimum number of times. So this optimization would
be implemented by checking to see if the execution frequency of the loop was indeed more
52
than the assumed minimum. If not, then the target code would not be modified.
The current implementation with the default profiler provides frequency counts as the feedback data, therefore all the optimizations outlined above are supportable by the interface
and the xprof profiler together. Unfortunately, time constraints would not allow the implementation of all of these, although branch prediction was attempted.
implementation of branch prediction are given below.
4.3.2
The details of the
Using Other Types of Feedback Information
Although certainly a useful type of feedback data, the frequency information is not the only
type of merit. As mentioned above, if we had data that summarized values of variables
throughout the run, we might be able to use this to obtain correlations with procedures
called, or memory addresses referenced. One use of such data would be to improve cache
performance by reordering instructions or by hinting at cache lines to prefetch. Another
use of this type of data is in conjunction with techniques that statically analyze programs
to derive symbolic expressions for the probabilities and relative frequencies of edges in the
flowgraph such as the one described by Wang in [18].
Since the semantic model used is the control flowgraph, any type of data that is basic block
oriented, would be suitable for directing optimizations. In particular, if we had a profiler
that returned cache miss information, say, per basic block, then we could implement the
optimization described above in the system as is with no need for modification to the
interface.
Unfortunately, a limitation of the implementation is that it assumes the data collected is for
the control flowgraph. While the drawbacks to this assumption are bearable, it would have
been cleaner to have the model that the data applied to specified in the interface. Then we
would be able to handle almost any type of feedback information that someone could think
of.
4.4
Branch Prediction
Branch Prediction was the only feedback based optimization that was implemented. The
compiler assigned probabilities to each edge based on the frequencies obtained from the
flowgraph annotation stage by dividing the frequencies on each edge by the sum of the frequencies of edges leaving the same node. Edges were also assigned costs which corresponded
to the lengths of the basic blocks plus the inherent cost of traversing the branch depending
on whether it represented a true or false predicate of the conditional branch statement. The
probabilities on each edge were then multiplied by the assigned costs to yield a weighted
cost for each branch. If the weighted cost for a branch was higher than the other by a fixed
threshold value, then the compiler rearranged the condition so that the higher weighted
cost was on the true branch.
We used an expected cost criterion in light of Fisher's report of better results when "number
of instructions per branch taken" was used as a criterion for predicting branch directions
instead of just the raw probabilities [11]. Note that we could just as readily have used
53
if C' then L2
if C then 1
Basic Block 1
Basic Block 2
goto L3
li:
11
Basic Block 2
Basic Block 1
Figure 4.1: Reordering Basic Blocks for Branch Prediction
the raw frequencies as the predicting data, by simply changing the expression for assigning
weights to the graph edges in the specification file. This relative ease of adjusting the
criterion is a clear advantage to the programmable interface and modular design of the
implementation.
4.4.1
Implementing the Transformation
There are two steps involved in this implementation of branch prediction:
1. Negate the condition test for the branch instruction
2. reorder the basic blocks so that the target of the previously true condition immediately
follows the branch.
The first step preserves the semantics of the program after the basic blocks are moved
and the second step moves the target basic block to the fall through position. After these
two steps have been completed, the fall through and taken basic blocks for the branch
instruction would have been exchanged. As an example, suppose code at label L 1 is reached
when condition C evaluates to true, otherwise the code immediately following the branch
instruction (call that point L 2) is reached (see Figure 4.1). Then to reorder the branches,
we want L1 to be located where L2 is currently, but the code should still have the same
semantics as before the transformation. Now, if L 1 is to be placed immediately after the
branch instruction then it can only be reached if the condition at that branch instruction,
C' evaluates to false. Since L 1 was reached when C was true but when C' is false, then it
must be that C' is the negation of C. So we see that the two steps described will produce
a semantically correct reordering of the branch target basic blocks.
In the actual implementation, by the time this optimization is to be done, the code has
already been laid out. This meant that it would have been difficult to reorder the basic
blocks according to the dictated order of the probabilities and costs. Instead, whenever a
54
Before
After
CL.6:
CL.6:
C L.
c4
cr318=gr315,gr3178,
BF
CL.6,cr318,...
c4
cr318=gr315,gr317
.
2:
CL.2:
CL.2:
B
CL.6
Figure 4.2: Inserting Unconditional Branches to Reorder Instructions
basic block needed to be relocated, an unconditional jump to it was inserted at the point
where that code would have been placed - typically immediately after a conditional branch.
Figure 4.2 illustrates how this reordering of branches was done. The code fragment is in XIL,
the intermediate language of the compiler. BF and BT mean "branch if false" and "branch
if true" respectively; B is the mnemonic for the unconditional branch instruction; and CL.n
is simply a compiler generated label. The remaining instructions are not important to the
discussion, although they indicate the relative positioning of the branch instructions.
The problem with this implementation was that the RS6000 does not carry the conviction
of its assumptions in that it will not followthe unconditional branch given that it previously
assumed that the fall through branch would be taken. The solution to this problem was to
move a few instructions from the beginning of the target basic block (the one that should
have been moved had it been easy to do so) to immediately after the conditional branch
and unconditional branch. These instructions fill the pipeline so that when the RS6000
encounters the unconditional branch inserted there by the compiler, it can fetch the branch
target of the unconditional branch at no extra cost in cycles (see Figure 4.3). The net
effect is a reduction in the total number of cycles for the most likely case of the condition
code - based on the statistics generated by the input set given in the profiling stage.
4.5
Bibliographic Notes
Fisher originially presented the technique of Trace Scheduling as a method for compacting
microcode for a VLIW machine. The larger code fragments permitted more data independent instructions to be identified so that they could be executed in parallel. The wider
instructions led to shorter code. This technique has been extended to uniprocessor compilers to allow the optimizer to work with the larger code fragments. It has been demonstrated
to have favourable results in refChH,HM in the IMPACT compiler at the University of Illinois where it has been implemented. In [4] Chang, Mahlke and Hwu show how a variety
of classic code optimizations can be implemented using frequency feedback information.
Fisher and Freudenberger recently presented some results on branch prediction that indicated that measuring the success of branch prediction can, and ought to be done, by
taking the cost of mispredicting branches into account, in addition to the fraction of such
55
Before
After
CL.6:
CL.6:
L4A
L4A
L4A
L4A
M
M
CL.9:
c4
BT
cr318=gr315,gr317
CL.2,cr318, ...
B
CL.6
c4
BT
L4A
CL.2:
cr318=gr315,gr317
CL.2,cr318, ...
L4A
.. .
M
.
B
CL.9
.
.
CL.2:
Figure 4.3: Filling the Pipeline between Branch Instructions
branches
[11].
The idea of using information about the values of variables in the program was inspired
by informal talks with Michael Blair. He proposes to use such information to help direct
code specialization for particular data types within languages which are not type checked
at compile-time [3].
56
Chapter 5
A Theoretical Framework for
Evaluating Feedback-Driven
Optimizations
5.1
Introduction
In this chapter, we use an annotated control flowgraph as the working model for a procedure.
We present a general framework for reasoning about how a transformation is directed by
feedback data to modify the procedure, and how its relative benefit can be assessed. We
also present a technique for annotating the control flowgraph based on execution frequency
feedback data and a method for analyzing code transformations based on this technique.
This technique is presented in the context of the framework outlined, with specific meanings
assigned to the parameters of the framework. Wethen apply the general technique to branch
prediction in particular to show how that optimization would be analyzed using the theory
presented.
Ideally, we would like to view a program at two resolutions: the call graph and the control
flowgraph. In the call graph, each node represents a procedure and an edge from node A to
node B exists if procedure A can call procedure B. This graph represents the broad view of
the program, and profiler information for this model usually comes from sampling the state
of the machine at regular intervals. The relative frequencies with which the edges and nodes
are annotated indicate which procedures are called most often and from where. This model
would be used first, as a guide indicating which procedures ought to be analyzed further.
Deciding that, the control flowgraph would be used to model one of those procedures, where
we could then perform the analysis described in this chapter. This ideal situation would
require that both granularities of profiled information be available simultaneously, which
unfortunately, is not available within this current implementation. However, it is relatively
straightforward to manually obtain profiler information on the call graph so the effects of
that shortcoming are minimal.
57
5.2
The Generalized Framework
We characterize feedback based optimization on a deterministic procedure, P, with the
following parameters:
1. The set of inputs, I, that P accepts.
2. The set of procedures, A, derivable from semantic preserving transformations on P.
3. The control flowgraph, Gp = (Vp, Ep).
4. Cost functions C, : Vp -+ R and Ce : Ep - R for the nodes and edges of Gp.
5. The feedback data, qbp - (o , e ), for both the nodes and edges respectively, where
cp
I --. F, = {f
f : Vp -+ R}
and Op : I
Fe = {f f : Ep -- R}.
6. A function to compute the feedback, p, for a set, S, of inputs in terms of Op(a), for
each a E S. bp : P(I) -- Fv x Fe,, where 7(I) is the power set of I.
7. The set of feedback based transformations, T, that preserve the semantics of P. T =
{f I f: A -- A}}.
r I r : F, x Fe
8. A measuring function, m : A x I - R
The cost functions can represent any innate features of the program that are not dependent
on the feedback, but are considered either by some transformations or by the measuring
function or both. The feedback data has been divided into two types of functions, one for
the edges and the other for the nodes of the control flowgraph. Note that both of these
functions result in functions that assign real numbers to the edges or nodes, as appropriate,
of the control flowgraph Gp; this models the fact that the feedback is used to annotate the
entire flowgraph. Although the resulting functions annotate the graph with only a single
real value, we could easily extend the definition to assign a set (vector) of real values, or
even values from some other algebraic system other than a field, so long as they can be used
in the computation described by m.
The set T, is the set of all the transformations that use feedback information and preserve
the semantics of P throughout their transformations. Each of the transformations within T
use the annotation of the whole flowgraph to produce new code semantically equivalent to
P. The feedback data for a set of inputs, Up is still a pair of functions from the sets F, and
Fe so that it can be used to direct a transformation in T. Therefore, T is not constrained
to contain only transformations that base their manipulations on feedback from only one
input, although these are certainly contained within T.
The measuring function is a generalization of the various means by which we could assess
the benefit of an optimization. In this framework, we will treat the function value of m
as one that needs to be minimized. That means that transformations which yield smaller
values for m are better than those that yield larger values of m. A good analogy for m is
the running time, although, we do not restrict ourselves to considering the running time
as the only possible measure for how good an optimization is. One thing to note is that
m is defined on a pair of a procedure and an input, but this is slightly deceptive. In
58
fact m is allowed to depend upon any of the annotations of the graph since they are all
dependent upon either the program alone, or the input in combination with the program.
The measuring function will typically be in terms of the cost function or the graph and its
probabilities or any combination of them. The transformations in T produce programs in A
from programs in A, so that the results of the transformations are also valid arguments to
m. As an example, suppose that procedure P is profiled with input a, and then transformed
with a transformation r which is based on the feedback results qp(a), then we will say that
r benefitsinput P if m(r[p(a)](P),/)
5.2.1
< m(P,^).
An Input-Independent Assessment Function
One thing to note is that m is not computed, but rather it is measured, since it involves
knowing the bahaviour of a program on a particular input. In other words, although we
may write a condition on the relation between the values of m to indicate a beneficial
transformation, there isn't any actual way of computing m if it depends upon the input
in any non-trivial way, since in many instances such a function reduces to deciding the
halting problem. Measuring the value is infeasible, because that amounts to simulating the
procedure which can be as bad as actually running it, and therefore unacceptable because
of the amount of time it could take.
The basic
traversing
emanates.
ultimately
assumption of the model is that we will be able to compute the probability of
an edge given that the procedure has reached the node from which that edge
This probability should be based on the feedback data in some way and will
represent the accuracy of the model. These probabilities will be used to derive a
summary function in that represents m "summarized" over all the inputs. In this fashion, we
obtain a function which is computable as well as independent of input and hence represents
a feasible method of assessing a transformation.
In this presentation we shall use the expected value of m as the summary, however there
is no reason that in should be restricted to this type of average value. For example, we
could imagine using the most likely path in the graph as the modal analogue to the average
instead; or as a slightly more sophisticated value, we could use a weighted mean of the
fewest paths that contribute say 90% of the probability of reaching t from s . These modal
values would be more useful in the cases of noisy data where highly uncommon inputs that
appear in the feedback data influence the mean so that it compensates for them spuriously.
A modal analysis ignores these data, so that the assessment is closer to reality.
We have mentioned two possible methods of doing an average case analysis for assessing
transformations. There are undoubtedly others that will have merit in some context or
other, however the point was to demonstrate how we can go about assessing the relative
benefits of transformations on procedures. In this manner, we will have a way of deciding
whether a certain optimization ought to be done, without having to actually implement it.
This analysis technique should prove to be especially useful for dynamic feedback where
the compiler automatically re-optimizes based on new feedback, and so needs to be able
to assess transformations without actually implementing them and simulating or observing
their running times.
59
Defining The Summary
We shall use the expression Pr[E v] to denote the probability that event E occurs given that
the current state of computation is at node v in Gp. So Pr[v I u] represents the probability
assigned to the edge (u, v) and Pr[a u] for some a E I represents the probability that
the input a was used given that the computation reached node u in Gp. We also assume
the existence of two distinguished nodes s and t in Gp, representing the entry and exit
points of the procedure P. All computations are assumed to start from node s and if any
computation reaches node t it is considered completed. If there is more than either one of
these types of nodes, we can always insert a node of the appropriate type so that it becomes
the unique one of that type. For example, if there are two exit points, t' and t", then we
could insert an extra node t and connect t' and t" to it so that t would be the unique exit
node.
The summary function, i is given in terms of m by:
m: A
R,
mh(P)=
E
m(P, )Pr[a I t
CEI
Since P is assumed to be deterministic, every input a corresponds uniquely to a (not
necessarily simple) path, pa = (s,v 1 ,v 2 ,..., t), in Gp, and we can express m(P, a) as a
function h, of p., which is computable if pa is known. So,
Pr[p] I t] =
Pr[v I u]Pr[t is reached]
II
(u,v) EP,
and we get
n(P)
=
(pa)Pr[pa]
E
t]
(5.1)
Pr[v u]Pr[t is reached]
(5.2)
cvEI
fh(P)
=
mfh(p,)
caEI
5.2.2
II
(u,v)EpC
Quantitative Comparison of Transformations
Having defined our summary function, we now show how we would use the model to evaluate
the merit of a transformation. For any given transformation r E T, we assume that r
includes rules for how the probabilities of new edges are assigned based on the original graph
annotations, since we do not have feedback data for those new edges to indicate how to assign
probabilities to them. Therefore, we can apply r to P to obtain a new control flowgraph
Gp, from the resulting procedure P'. Since r E T, we know that for the set of inputs, S,
used in the profiling, P' = r[qp(S)](P) is in A, and therefore the corresponding control
flowgraph, Gp,, could be used to measure m(P', a) for any input a and hence to compute
fi(P'). Hence, we can compare the values ii(P') and m(P) to decide whether or not r is a
worthwhile transformation under the input circumstances implied by S. If fh(P') < h(P)
then the expected value of m - the function which defines how transformations are to
be compared - for P' over all the inputs, is no worse than that of P, assuming a usage
pattern represented by S. In this case, we would conclude that it is beneficial to perform r
on P. Notice that we could extend this further by computing the difference f(P) - n(P')
for many different candidate transformations on P. The transformation with the maximal
difference would be selected as the one to yield the highest expected benefit.
60
5.3
Cycle-Latency Based Assessment
Since the majority of compiler optimizations aim to improve the running time of the application being compiled, we shall show how the framework can be parameterized to assess
these optimizations. We use an estimation of the number of cycles a basic block will take to
execute (the cycle latency) as our representation of the execution time. There are actually
two possible types of annotations which we can work with: label the nodes with latencies, or
label the edges. Annotating the nodes allows for simpler calculations, but has the problem
that it does not distinguish between branch edges which may have different cycle latencies.
Annotating the edges does take into account the differences in branch latencies, but has
a more complicated analysis. Both methods are presented below, the only difference in
the parameters being the use of an edge function instead of a node function. As a side
effect of using the parameterizations given below, both models allow us to compute the
average running time of a procedure all within the confinesof the framework, and the given
parameterization.
5.3.1
Parameter Specification
The feedback data is given by the frequencies with which the nodes and edges were traversed.
We use the the feedback for a set of inputs, p and derive the probability for traversing
an edge, given that we are currently at the node from which it emanates, by taking the
fraction of its frequency of the frequency of the node; that is:
Pr[(u, v) I u] =
,(u, v)
We assume that some reasonable method was used to summarize several profiled runs
with several different inputs so that these probabilities will reflect the current usage pattern
of the procedure.
Annotating Edges with Costs
The cost functions are specified by estimates on the number of cycles the instructions
within a basic block would take to execute. In some machines, for example the RS/6000,
the branches also have costs assigned to them, although they are not associated with any
code. The cost assigned to an edge is the cost of its destination node summed with its
inherent cost. The reason for the difference in inherent costs on branches is that taking
one branch may have a larger delay associated with it than taking an alternate one. In
the RS/6000, this difference in delay was exploited with branch prediction as described in
Chapter 4. The cost functions can be defined in terms of the architecture being used so
that architecture dependent properties are captured in the model. The difference of cost in
the branch taken in the RS/6000 is one example of such an architecture specific property;
but we can also extend the cost function to handle properties such as expected delay for
LOAD
instructions and some arithmetic operations that may take more than one cycle to
complete. By choosing the cycle delays according to the architecture of the machine, we can
61
improve the accuracy of the assessment, as well as customize the analysis for the underlying
architecture.
The value of the measuring function, m(P, a), is the sum of the costs of the edges in p,,
where as above, pe is the path used in Gp by running P on a. Keeping the same notation,
the entry node is named s and the exit node t, given the probabilities and costs on each
edge, we define m as follows:
m(P,a)= E
C(e),
eEpalpha
We summarize the average running time with the weighted average measurement.
weighted average, mh is defined as follows:
7h(P) =
E
st
C(p)Pr[p]
This
(5.3)
p
where
C(p) =
ECe(e)
eEp
Pr[p] =
e(uv)
(u,v)EP
Y4 U)
We shall show later that mhis well defined when all edges have strictly positive probabilities.
In the cases, where some of the edge probabilities are zero and that causes the probability
on another edge to be one, then we have to be a little more careful. A discussion on this is
presented in Section 5.3.2.
Annotating Nodes with Costs
In this model every node is assigned some cost which represents the number of cycles required to execute the code for the basic block represented by that node. The measuring
function, m, is defined in a completely analogous manner to the definition for edge annotations, except that we sum the costs of the nodes on a path instead of those of the edges
of the path. Likewise for the definition of mii. Note that the probabilities are still on the
edges, as they were before. So only the definitions for m and C(p) need to be redefined:
m(P,a) = E C(u),
uEp
where p is the path used in Gp by running P on a.
C(p) = E CV(U)
uEp
5.3.2
Handling Infinite Loops
The weighted average m is defined in terms of all the possible paths from s to t, therefore
we have to take care that it is well defined despite all the cycles possible within Gp. A point
62
to note is that m is defined as the modeled cost between two distinguished nodes s and
t, and we condition our computation of i on t being reached. This should automatically
eliminate any inputs that would cause P to loop infinitely since they would all be assigned
probability zero, given that node t was reached.
In practice, this is not quite guaranteed, since we don't always terminate in running P
while collecting feedback - sometimes the profiler stops collecting information because it
has reached its buffering limit, for example. In cases, like these, the feedback information
is still useful, but we are not guaranteed to have reached node t in Gp. We can prevent
unfortunate infinities in our calculation of fimby observing the annotation on the graph
before doing any computation to see that there is at least one path from s to t that has
been annotated with a non-zero probability. If there are no such paths, then our average
costs are assumed to be infinity. To enable comparisons even with infinite values for n,
we can use different end points within Gp that have non-zero probability paths between
them. Doing that would still allow us to assess the effect of transformations on some of the
internal nodes of the procedure, though we would have to be careful about which nodes we
do evaluate, and how we use that evaluation.
We should distinguish between the case of the infinite loop mentioned above and the case of
the closure of a cycle in the graph. Note that in the average case analysis, we compute n as
a weighted average of m over all inputs. This means that whenever we have a cycle in the
graph on a path from s to t (corresponding to a loop in the procedure P) we must average
over the inputs that would cause that cycle to be traversed. Since we do not perform any
semantic processing on the graph, we must assume that it is possible to traverse any cycle
an arbitrary number of times. That is, we must take the average over a countably infinite
number of (finite length) paths. The value of that average is the closure associated with
that cycle which is computed as follows: if p is the probability of traversing the cycle and
c its cost, the value of the closure is given by:
00
Encpn
n=l
If there is some edge on the path that has a probability less than 1 then we are guaranteed
that z is well defined since the value of p in expression 5.3.2 is less than 1 and hence that
expression converges to:
cp
(1- p) 2
which is finite. Thus, all cycles that meet the criterion stated have a finite cost, all acyclic
paths also have finite cost and there are a finite number of cycles in any graph, hence the
average cost computed for mhwill be finite. We pointed out that we would have to check
first, before trying to compute f, for the existence of a path from s to t with non-zero
probability. Since we condition our probabilities on reaching node t in the calculation of
mh,then any cycle that falls on such a path cannot have a probability of 1 for that cycle.
Hence, while computing fn, every cycle encountered on a valid path from s to t will satisfy
the criterion that at least one edge has a probability less than 1. That is to say, we need not
worry about inifinite loops so long as there is a path from s to t with non-zero probability.
63
5.3.3
Computing the Average Cost
In the scheme outlined where the costs were associated with the nodes, we can view the
model as a Markov chain with rewards since the calculation of rh for a path has the Markov
property. We can see this esily, since
fr(Ul,
U 2, ...
, Ui)
U -- (U, 2, . , Ui-1)+ C(ui)
so that the intermediate value along the path depends only on the previous value and the
current change in cost. The computation of h is simply the calculation of the expected
accumulated reward for travelling from s to t. This particular problem happens to be a fairly
standard problem in the field and can be computed in a fairly efficient manner (polynomial
in number of nodes).
In the case where we use the cost annotated to the edges of the flowgraph, we have to
modify our model a bit to be able to apply Markov chain theory since it is defined on
rewards associated with nodes. One way to do this is to assign each node the sum of its
incoming edges' costs and then perform the analysis on this new graph. Another method
is to transform the graph so that the edges become nodes and the nodes become edges. In
either method, we can obtain a graph which fits the model used in Markov chain theory
with rewards, and so we can use the theory to compute the value of mii.
5.4 Graph Annotation
Since the summary function relies on the probabilities assigned to the edges of the graph,
they determine its accuracy, and hence are extremely important. There are several methods
to obtain probabilities for the edges which have varying degrees of sophistication. The
method used in the technique for computing expected running time used a simple model
for the probability law for each edge. This may suffice for a static feedback setting since the
predicting set supposedly contains inputs which are equally likely to occur in actual usage.
We could also ascribe a weight to each input in the input set that would be used in the
computation of the branch probabilities; this would bias certain inputs of the set according
to the usage pattern.
In the dynamic setting, we would probably need a more sophisticated computation that
took into account feedback data from the previous runs. Thus, the probabilities would be
dependent upon their previous values and the most recent feedback data. This seems like a
reasonable starting point for investigating the type of probability law for these edges, since
intuitively, we would imagine that programs are used in similar ways with similar inputs
over short periiods of time. This seems reasonable, since we typically work on one problem
at a time, and so the tools we use in the time we are working on that problem tend to get
the same types of inputs. Of course, we also want the system to respond to changes that
we make in our usage patterns. Herein lies the problem since we must experiment to decide
how to tune our probabilities so that they do not lag too far behind the actual usage, and
at the same time, they are not too sensitive to minor changes in usage.
Clearly, there is much work to be done in this area. At the moment, we still do not
even know what a suitable control law would be for determining the edge probabilities as
64
Figure 5.1: The Branch Prediction Transformation
more feedback data is collected. However, there is little we can do hypothetically, and all
progress here, is bound to come from actual experiments with new models and with new
parameters for old models. This particular aspect of the field should prove to be an exciting
an challenging area for research.
5.5
Analysis of Branch Prediction
In this section, we analyze branch prediction on the RS/6000 as a specific example of an
optimization to demonstrate the application of the theoretical analysis described earlier. We
shall maintain the definition of ih, the average cycle latency of the procedure P, along with
all the previously specified parameters from the description of the technique for measuring
average case performance. The only thing left to specify is r the transformation. In the
case of branch prediction, r is an especially simple transformation since it does not change
the structure of the graph, but only rearranges some of the annotations. Figure 5.1 shows
the transformation that occurs when a branch satisfies the criterion for being rearranged.
The RS/6000 places a penalty of up to 3 cycles on a taken branch path of a branch instruction. The branch prediction transformation rearranges this penalty assignment so that the
branch path that is traversed less frequently is assigned the penalty. The expected value at
node 1 is the same for both branches, so the difference in expected value after the transformation is performed, will be due to the difference in expected cost starting from this
node. We also assume that both branches will reach the exit node. Therefore the expected
decrease in cycle latency will be
(f12c1 + f 13(c 2 + 3))-
(f1 2 (cl + 3) + fi 3c 2)
which reduces to:
3(f13 - fi2)
This represents a decrease in cycle latency if f2 < f3 which is essentially what is done.
In the actual implementation, a threshold is needs to be exceeded by the difference in
frequencies before the transformation takes place. This threshold, reduces the sensitivity of
the transformation to noise in the feedback data.
65
5.6
Bibliographic Notes
Very little has been pubished on the theoretical aspects of feedback-based compilation.
Conte and Hwu [6] presented a theory for measuring the benefit of optimizations in terms
of two parameters: the change in the performance due to the optimization and the change
in the optimization decisions made as the inputs were varied.
Wang presents a method for annotating the flowgraph using static semantic information
combined with algebraic symbolic manipulations, instead of profiling information, to compute probabilities [18].
One method of measuring average execution times based on node costs and probabilities
was presented by Sarkar in connection with work done on a parallel compiler at IBM [15].
The method involved first transforming the flowgraph to a tree (details are given in [16])
and then computing the probabilities for the tree so derived.
The ideas for using Markov chain theory to compute m were contribubted in part by Richard
Larson and Robert Gallager, independently. The notes for the class they have both lectured
recently at MIT written by Gallager present an introduction to Markov chain theory with
rewards [12].
66
Chapter 6
Conclusions
6.1
Evaluation of Goals
The goals of this thesis were two-fold:
1. Design and implement a feedback mechanism, within an existing compiler, in a modular fashion that provided flexible methods for using various type of information in
various ways.
2. The second goal was further divided into two parts:
(a) Investigate the feedback based optimizations possible and any of the special
requirements they may have.
(b) Research the various ways in which programs and their inputs could be characterized so that the feedback-based compilation process could be analyzed in a
quantitative manner.
6.1.1
The Design and Implementation
The primary goal of this thesis has been realized. We have shown that it is possible, and
indeed feasible, to implement a feedback based compiler in a completely modular fashion.
The advantages of such an implementation were made clear by the flexibility in the types of
feedback available, and especially by the ease with which we were able to use these different
types of data. The disadvantages were primarily a degradation in running time and an
increase in the knowledge required to use the system. The latter, we demonstrated can be
offset by the use of defaults to facilitate casual users, and the former's effect is reduced, in
part, by clever implementations of aspects of the system such as the algorithm presented
in chapter 3. We believe that the disadvantages of this approach are outweighed by the
advantages, especially in the current state of the art of feedback based compilation, where
we are very much uncertain about what types of feedback are relevant and what the best
ways to use such informaition are.
67
One of the problems that arises directly from the modularity of the implementation is that
the profiler cannot assume anything about the program it is given to profile, and yet it must
try to produce the most general information as possible. In the particular case of trying
to produce edge weights for the compiler, we demonstrated that this can be accomplished
efficiently by profiling just the node weights. We exploited this fact within the compiler
by requiring the profiler to profile only the basic blocks of the program given to it, and
then analyzing the graph to determine the optimal placement of instrumenting basic block
nodes. In this manner, the cost of the most expensive operation, profiling, was reduced
significantly. In our analysis of the algorithm, we obtained interesting results regarding the
effects of the graph's topology on the number of edges that needed instrumentation and the
applicability of its surprisingly simple greedy strategy which, in some sense, could afford to
be oblivious of the topology of the graph.
In addition to our particular implementation, we have identified features which any implementation should have, and we have pointed out ways in which they can be accomplished.
In identifying these necessary features, we hope not only to facilitate future implementations
and improvements of the current one, but also to establish the salient features of feedback
based compilers that would lead to more structured thinking and focussed effort in the area.
6.1.2
The Optimizations
The optimizations that stand to gain from feedback information are numerous, and many
were mentioned in the body of this thesis. Although, the time available did not allow
us to explore this area satisfactorily, we have argued that the optimizations that can be
implemented are highly dependent on the types of feedback information being provided, and
on the semantic models used within the compiler. This field, being a relatively unexplored
one, has many unanswered fundamental questions. We have mentioned a few: like what
kinds of feedback information ought to be considered in optimizations, and what kinds of
optimizations would be beneficial and in what ways. Although we do not have the answers
to these questions, we have tried to show the connections and interrelations that must exist
between them, based on the optimization methods we use today.
6.1.3
Theoretical Analysis
Our secondary goal had two aspects. The first was to explore the optimizations possible,
which we have just summarized. The second aspect was to address some of the theoretical
issues involved in process of feedback based compilation. This latter part of the goal was a
rather ambitious one, mainly because of the scant research in that particular aspect of the
field. We have presented a theoretical framework for describing the salient aspects of the
process, and have demonstrated how it could be used to assess the benefit of optimizations.
Although incomplete, in that we do not have enough applications of it to actual optimizations, we feel that the framework set forth contains the key factors for analyzing any
feedback based transformation. By describing the process in formal mathematical terms,
we can demonstrate and investigate its various properties. More importantly, though, this
formalization helps us to answer some questions which would otherwise have been left up to
speculation. For example, in the particular specification outlined in chapter 5, we showed a
68
method for analyzing the average running time of a procedure. This method in itself can be
used as a general device to assess the relative merits of various transformations, assuming
the running time (actually number of cycles) is the measurement that we are interested in.
While this framework is quite general, it can be specialized to assess transformations in
various ways which are not limited to the typical notions of "faster is better". Although
that may ultimately be the case, the particular goal of a transformation may not be to
reduce the cycle time by modifying the instructions directly. It may, for instance, be a
transformation to reduce the size of the code so that some beneficial side effect takes place
(such as fewer cache misses) as a direct result of that transformation. Since there may
be many optimizations which will have immediate goals other than reduced cycle time,
it is perfectly reasonable not to specify what is being measured for relative assessment
of optimizations within the general framework. Rather, we leave it as a parameter to be
specified according to the particular set of transformations being compared.
The apparently general use and broad scope of the framework is promising, since it provides
us with a means of reasoning about feedback based transformations, and yet does not
presuppose anything about the type of data we are using for feedback nor the types of
optimizations we intend to perform, nor even how we intend to measure the benefit of the
proposed transformations. Of course, having an overgeneralized model prevents us from
reaching any non-trivial conclusions, and so is useless. However, we have demonstrated
that we can, in fact, specialize some of the parameters of the framework and still be able to
reason in very general terms about the transformations we may perform. This demonstrates
the viability of the framework since it is general enough to handle probably any feedback
transformation, yet sufficientlydefined to produce results based on quantitative analysis of
the transformation.
6.2
Future Work
Due to the lack of time, there were many aspects of the implementation which could be
refined. An example is the user interface. Currently, the specification file contains an
expression in terms of the profiler's raw values that represents the data that ought to
be annotated to the internal data structures representing the application being compiled.
For efficiency reasons, the specification file is read first and that expression compiled and
incorporated with the rest of the compiler, so that the expression (which must be evaluated
for every line of output of the profiler) will be executed compiled instead of interpreted.
This need for the user to recompile the compiler everytime the specification file is changed
could be obviated by automating the process. On a more fundamental level, however, much
thought needs to be given to the interfacing of the three most important parts of the system:
the compiler, the profiler and the user. We have outlined what the necessary features are for
a basic interface between the compiler and the profiler, but we have not said much on the
user interface, nor have we mentioned how that would be affected by the other interface-as
it indeed is.
We have outlined the features we believe to be the keys to a good feedback mechanism,
however, many other profilers ought to be used with an implementation of such a compiler
to see if these guidelines will truly bear out in the most general of cases. On this matter, our
69
current implementation, although it can handle various types of information, it can only do
so one at a time. Ideally, we would like to be able to glean various types of information and
use them all in the same pass of compilation. This might require the use of two profilers or
perhaps the same profiler in two different configurations, but it would require an expansion
of the amount of data being stored and more importantly, a means of linking the data
stored to the relevant data structure. In this implementation, it was assumed that there
was only one data structure and that it was the control flowgraph. However, this is not
general enough, and ideally we would like to be able to store data for arbitrarily many
possible data structures for use with various possible optimizations. An example of this
need for annotating more than just a control flowgraph was pointed out earlier, where we
mentioned that ideally we would like to have two granularities of representation for the
program: a lower resolution one (a call graph, for example) so that procedure interaction
could be summarized easily , and a higher resolution one that focusses on the internals of
procedures. Both of these semantic models would be useful together when annotated with
feedback data and so the feedback mechanism should have the facility to accommodate
them both.
6.2.1
The Node Profiling Algorithm
Regarding the algorithm, although a detailed treatment was given in chapter 3, there were
still a few points of interest left to investigate. For example, although we know that the
topology of the graph determines what the dimension of the right null space of the coefficient
matrix, M, will be, we do not have a good way to describe what it means for the graph
when we know the dimension of the null space of the matrix. In solving the problem in the
algorithm, we work with the matrix and determine its right null space in the end, without
ever deducing the properties of the graph itself. The second unexplored area with respect
to the algorithm is that we could conceivably improve the performance of the code by
performing a second pass of the profiling stage, so that even the profiling could benefit from
the feedback. An example is that if a node were assigned a zero frequency count, then every
edge connected to it, automatically gets a zero assigned to it. The possible improvements
along these lines are explained in detail in chapter 3.
6.2.2
Optimizations
The optimizations possible within a feedback system are certainly among the most interesting aspects of the feeback based system. To start with, all the Trace Scheduling based
optimizations which have been implemented before, have been documented to be successes
and so should be implemented. Further, some of the optimizations mentioned above have
non-feedback analogues which work well and hence show promise for the feedback version.
For example, branch prediction has been implemented on non-feedback based compilers
using heuristics to determine which way branches ought to be predicted. These have produced relatively good results [2] and so would probably yield net benefits when applied to
feedback information.
Some of the other types of optimizations mentioned in chapter 4 involve using different types
of information that mere frequency counts. The example of using cache miss data to deter-
70
mine better layout is an interesting one, and depends on factors that are typically outside of
the normal considerations for optimizations. Further investigation also needs to be done on
what sort of data should be stored based on the relative benefit of transformations based on
that data. That research would also reveal relationships between program structure, program performance and various qualities of the input. It promises to be quite exciting since
it poses several fundamental questions which we have no good way of answering offhand.
6.2.3
Theoretical Issues
We have attempted to address some of the theoretical issues involved in the feedback compilation process with the framework presented. Although the example of its application to
measuring average runnning time lends credence to the usefulness of the model, we need
to show that it can be used for other parameter specifications with equal success. In the
description of measuring the average running time, we showed that the average described
always existed as long as none of the looping edge probabilities were 1. We also mentioned
how that value could be computed, but it is a rather expensive operation and we would like
to find more efficient implementations, if indeed they exist.
The specification of the framework presented, itself, is general enough to be applied to
many of the optimizations mentioned, as well as some that are not particularly feedbackbased, but could be adapted to be so. Note that measuring the average cost need not be as
complicated as was specified. However, with the given specification, we model the individual
contributions of all the edges in the paths through the graph so that the average so obtained
will be quite accurate. Further work in this area could produce interesting alternatives that
are easier to compute, and just as representative of the actual costs incurred by executing
the procedure.
One aspect of the framework which we did not address in detail, but which is extremely
important, is the way in which the graph is annotated. In the case of the given specification, the probabilities that are assigned to the edges will determine how accurate the overall
model is. Here, there is a distinction in the type of feedback compilation since the manner of annotation is very different when the application will be continuously profiled and
recompiled (the dynamic setting) from when it will not. The dynamic setting is in contrast
to when the compiler profiles the code once on a suite of inputs, and based on that suite
alone produces a set of probabilities for the edges. Since the latter is the current method of
implementation, and it probably will be until machines become powerful enough to be able
to reasonably support dynamic feedback, we need research in this area for the short term.
However, we also need research in dynamic feedback since the approaches to annotation are
different in the two settings. In the static setting, we annotate based on the input suite,
keeping in mind that those annotations won't change until the next time compilation is
done (manually). In the dynamic setting, we anticipate changing the probabilities assigned
and so we will tend to match the current data as near as possible to represent the current
usage pattern, whereas in the static setting, we will probably use a more balanced average
of all the values. In either case, it is quite clear that much more work needs to be done to
determine the appropriate methods for annotation of semantic data structures, such as the
control flowgraph.
71
Appendix A
The Specification File
The Specification file consists of several fields which indicate to the compiler how to interface
with the profiler. The fields are:
* The profiler preamble
* The profiler execute command
* Output separators
* The output template
* Switches to decide whether to accumulate frequencies and costs.
* The expressions for computing the weights to be annotated to the graph.
A.1
Profiler-Related Commands
The profiler preamble specifies what needs to be done before the profiler is called. In this
particular implementation, the xprof profiler required that the executable be instrumented
first as a separate procceess from the actual profiling. Others require that the executable
is generated with certain flags set at compile time, these would be specified here in the
preamble. The preamble is a string that will be executed as if it were passed to the shell as
a shell script.
The profiler execute command indicates how the profiler ought to be called. It is here
that the user would specify the mode the profiler should be in (if many are available) and
therefore what kinds of data ought to be fed back.
A.2
Parsing The Profiler Output
The output separators indicate what characters in the output file of the profiler are separators between the data. Typically, profiler output files are human readable, which means
72
that the data is usually separated by spaces or some non-numeric character. To preserve
generality, no assumptions were made about the format of the output file of the profiler,
hence these separator characters had to be specified.
Another aspect of the profiler output format that had to be specified was the output template. One assumption about the profiler output that was made was that the data would
be presented on a line by line basis. So one entire line was considered to correspond to one
record of data; furthermore, the arrangement of the fields within that record were assumed
to be consistent from line to line. The output template specified the arrangement of the
fields within each line. The output template was a string of names separated by one or
more of the characters from the output separators described previously. Here is an example
output template:
'freq bbstart bbend tag:file"
This would indicate that the profiler's output consisted of lines with five fields. The first
substring on a line surrounded by spaces would be given the name freq, the second similarly
demarked substring, on the same line, would be called bbstart, and so on until the fourth
value read on that line. The right separator for that substring would be a colon and it
would be called em tag. The rest of the line (after the colon) would be named file. Each of
these substrings would have some significance as specified by the expressions.
The three fields: freq, tag, file have reserved names since these are required fields of all
profilers. The file field is specified to make the tag reference in tag unique, as would have
been neccessary in any case, had the tags been line numbers. All other fields specified are
peculiar to the profiler being used, and can have any names not matching the three reserved
words mentioned above. The substrings within a given line that have been assigned to the
names in the output template are used internally either according to the manner indicated
in the expression, or if a reserved name, as the compiler was preprogrammed to do.
A.3
Accumulation Switches
While parsing the output file from the profiler, there has to be a policy on how to deal with
repeated occurrences of basic blocks. Basic block references may be repeated throughout
the output for various reasons depending on the deatils of the profiler. In this particular
implementation, some basic blocks, as constructed by the compiler, actually become multiple blocks in the executable and so appear to the profiler as multiple basic blocks. However,
they will all have the same tag, so the information is decipherable, but care has to be taken
while accumulating over the file, since some values may need to be accumulated while others
may not. There are as many switches as there are types of data to annotate the flowgraph
with, so in this implementation there are two independent switches: one for the frequencies
accumulated, the other for the accumulated costs.
73
A.4
Computing Annotation Values
The last part of the specification file contains expressions for assigning values to each of
the possible annotations to the cotrol flowgraph. In this implementation, we only have two
different types of information that can annotate the flowgraph, so we need only two different
expressions. The expressions can be arbitrary arithmetic expressions and must be in terms
of the names of the fields specified by the output template. So, if the output template had
been specified as above, valid expressions are:
costep = bbend- bbstartwtexp = freq*cost
The expressions are evaluated in the order that they are written in the specification file,
so the above examples are well defined. From this we see that if wt was the data on the
flowgraph used to direct decisions, then we could change the criterion being used by simply
adjusting the expressions in the specification file.
74
Appendix B
Invoking The Compiler
Below is the script that is called by the user, which in turn calls the compiler the appropriate
number of times with the appropriate options, depending on whether feedback was turned
on or not.
plix is the name of the program that calls the compiler proper.
#!/bin/sh
#If feedback optimization is turned on, then run compiler in two passes.
#On the first pass, turn off optimization and leave debug information
#in xcoff file (take out -0, -qpredset, and add -g).
#The second pass is simply the original call to the compiler.
#The prefix specifies where to find the appropriate compiler.
echo "$*" I awk '
BEGIN
prefix="-B./ -tc "
outstr=""
newcmd="-g"
fdbk = "false"
noexec = "false"
srcfile = ""
cmd = $0
}
index($0,"-qpredset")
fdbk = "true"
for
(i=1; i<=NF
!= 0 {
; i++)
if (substr($i,1,9) == "-qpredset")
; # do nothing
else
if (substr($i,i,2) == "-o") {
if (noexec == "true")
outstr = "-qexfn=" substr($i,3)
75
newcmd = newcmd " " $i
}
else
if (substr($i,1,2) == "-c")
noexec = "true"
{
outstr = "" #ignore output name
newcmd = newcmd " " $i
}
else
if (split($i, fn, ".") >1) {
if (index("pl8 p19 plix pl8pln
c C f", fn[2])
> 0)
srcfile = fn[1] #recognized source language
newcmd = newcmd " " $i
}
else
newcmd = newcmd " " $i
END {
if (fdbk == "true") {
print("plix " prefix "-qdebug=fdbkl " newcmd)
res = system("plix " prefix "-qdebug=fdbkl " newcmd)
if (res ==
) {
system("mv " srcfile".lst " srcfile".lstl")
print("plix " prefix "-qdebug=fdbk2 " cmd " " outstr)
system("plix " prefix "-qdebug=fdbk2 " cmd " " outstr)
}
}
else {
print("plix " prefix cmd)
system("plix " prefix cmd)
}
}'
76
Appendix C
Sample Run
To demonstrate how the system would be used, we present a small example in this appendix.
Suppose we had the following PL.8 code to compute factorial in a file called test.p 1 8
$ep:procedure(inpstr);
dcl
inpstr char(*);
dcl
num integer,
fact:procedure(n) returns(integer);
dcl
n integer;
dcl
i integer,
prod integer,
prod = 1;
do i = 1 to n;
prod = prod*i;
end do i;
return(prod);
end procedure fact;
num = integer(inpstr);
display("n:
"llchar(num)ll"
n!: "llchar(fact(num)));
end procedure $ep;
identifya
Now, if we wanted to compile it using the feedback-based compiler,we must first
predicting set of inputs to be used during the profiling of the program. Let us construct a
predicting set of 4 elements, say 7,5,4 and 8. The predicting input set file-let us say it is
called testinp-would
look like this:
7548
77
The command to compile the program (branch prediction turned on) involves a call to
-qpredset=testinp.
If the compilation is done with
tplix as follows: tplix test.p18
the appropriate compiler options we can get a listing file for the program containing the
transformation the code underwent due to the feedback information. Following is a fragment
of the listing file which demonstrates the details of the transformation described in chapter4.
** ProcedureList #5 for fact Before BranchPrediction **
8:
HDR
2:
3:
BBBEGIN
PROC
3:
ST4A
3:
3:
LI
ST4A
3:
3:
L4A
L4A
3:
3:
ST4A
ST4A
3:
3:
3:
LR
LR
C4
3:
BT
3:
3:
4:
4:
4:
L4A
M
4:
4:
LR
ST4A
AI
ST4A
LR
4:
4:
4:
4:
4:
L4A
C4
BF
4:
BBEND
4:
5:
BBBEGIN
CL.2:
5:
5:
L4A
LR
5:
5:
RET
BBEND
5:
6:
6:
6:
gr3=lO
IO(grl,O)=gr3
gr369=1
prod(grl,56)=gr369
*gr312=1O(grl,O)
gr313=n(gr312,0)
T)1(gr1,64)=gr313
i(gr1,60)=gr369
*gr315=gr369
*gr317=gr313
cr318=gr315,gr317
CL.2,cr318,0x2/gt
BB_END
BBBEGIN
CL.6:
L4A
4:
4:
2
BBBEGIN
CL.7:
PEND
3
*gr314=prod(gr1,56)
*gr315=i(grl,60)
gr370=gr314,gr315,mq"
*gr316=gr370
prod(grl,56)=gr316
gr319=gr315,1
i(grl,60)=gr319
*gr315=gr319"
*gr317=T)l(grl,64)
cr318=gr315,gr317
CL.6,cr318,0x2/gt
4
*gr314=prod(gr1,56)
gr3=gr314
gr3
5
BB_END
** End of Procedure List #5 for fact Before Branch Prediction **
** Procedure List #5 for fact After Branch Prediction **
8:
2:
HDR
BBBEGIN
3:
PROC
2
gr3=lO
78
3:
ST4A
IO(grl,O)=gr3
3:
LI
gr369=1
3:
ST4A
3:
3:
L4A
L4A
3:
ST4A
3:
ST4A
3:
3:
3:
LR
LR
C4
3:
BT
3:
BBEND
3:
4:
BBBEGIN
CL.6:
4:
4:
L4A
L4A
4:
4:
prod(gri,56)=gr369
*gr312=0(grl,O)
gr3i3=n(gr312,0)
T)i(grl,64)=gr313
i(grl,60)=gr369
*gr315=gr369
*gr317=gr313
cr318=gr315,gr317
CL.2,cr318,0x2/gt
3
*gr314=prod(grl,56)
*gr315=i(grl,60)
M
CL.9:
gr370=gr314,gr315,mq"
*gr316=gr370
prod(gri,56)=gr316
4:
4:
LR
ST4A
4:
AI
4:
ST4A
4:
LR
*gr315=gr319"
4:
4:
L4A
C4
*gr317=T)1(grl,64)
cr318=gr315,gr317
4:
4:
4:
BT
L4A
L4A
CL.2,cr318,0x2/gt
*gr314=prod(gr1,56)
*gr315=i(grl,60)
4:
4:
M
B
4:
BBEND
BB_BEGIN
4:
5:
5:
5:
5:
5:
5:
6:
6:
gr319=gr315,1
i(grl,60)=gr319
gr370=gr314,gr315,mq"
CL.9
4
(CL.2:
L4A
LR
RET
*gr314=prod(grl,56)
gr3=gr314
gr3
BBEND
BB_BEGIN
5
CL.7:
PEND
BB_END
** End of Procedure List #5 for fact After Branch Prediction **
6:
Of course, if we were not interested to know what took place, we would never need to
generate and look at such a file. We would simply compile the source and run the executable.
Note how simple it is for the user to use the feedback information, no prior knowledge of the
system nor its internals is required. In fact, there is not even mention of the specification
file in the command line since that has been taken care of by- the default specification file.
The only extra parameter is the name of the predicting input set, which is required under
any circumstance, and needs to be user supplied.
79
Bibliography
[1] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques
and Tools. Computer Science. Addison-Wesley, Reading, Massachusetts, 1986.
[2] Ball and Larus. Optimally profiling and tracing programs. In Proc. 20th annual ACM
SIGACT-SIGPLAN symposium on Principles of ProgrammingLanguages(POPL), All
ACM Conferences. The OX Association for Computing Machinery, August 1992.
[3] Michael Blair. Descartes: A dynamically adaptive compiler and runtime system using continual profile-driven program multi-specialization. Submitted to MIT Dept. of
Electrical Engineering and Computer Science, December 1992.
[4] Pohua P. Chang, Scott Mahlke, and Wen mei W. Hwu. Using profile information to
assist classic code optimizations. In Proc. Software - Practice and Experience, 1992.
[5] John Cocke and Peter Markstein. Measurement of program improvement algorithms.
In Proc. IFIP Congress, pages 221-228, 1980.
[6] Thomas M. Conte and Wen mei W. Hwu. The validity of compiler optimizations based
on profile information. In Proc. 17th annual ACM SIGACT-SIGPLAN symposium
on Principles of Programming Languages (POPL), All ACM Conferences. The OX
Association for Computing Machinery, 1992.
[7] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms, section 17, pages 345-350, 355. The MIT Electrical Engineering and Computer
Science Series. MIT Press and McGraw-Hill Book Company, March 1990.
[8] Andre DeHon, Ian Eslick, John Mallery, and Thomas Knight. Prospects for a smart
compiler. Transit Note 87, Massachusetts Institute of Technology, MIT Artificial Intelligence Laboratory, October 1993.
[9] John R. Ellis. Bulldog: A Compiler for VLIW Architectures. PhD thesis, Yale University, Department of Computer Science, February 1985. ACM Doctoral Dissertation
Award 1985, The MIT Press, 1986.
[10] Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction.
IEEE Transactions on Computers, c-30(7):478-490, July 1981.
[11] Joseph A. Fisher and Stefan M. Freudenberger. Predicting conditional branch directions from previous runs of a program. In Proceedings of the 5th International Con-
ference on Architectural Support for Programming Languagesand Operating Systems
80
(ASPLOS V), number 5 in All ACM Conferences, pages 85-95, Boston, MA, October
1992. The OX Association for Computing Machinery.
[12] Robert G. Gallager. Discrete stochastic processes. Class notes for MIT subject 6.262,
January 1994.
[13] Bell Laboratories. Common object file format (coff).
[14] Ravi Nair. Profiling ibm rs/6000 applications.
Simulation, 1994.
International Journal in Computer
[15] Vivek Sarkar. Determining average program execution times and their variance. In
Proc. of the SIGPLAN '89 Conference on Programming Language Design and Implementation, June 1989.
[16] Vivek Sarkar. Ptran - the ibm parallel translation system. In B. K. Szymanski, editor,
Parallel Functional ProgrammingLanguages and Compilers. ACM Press, 1991.
[17] R. G. Scarborough and H. G. Kolsky. Improved optimization of fortran object pro-
grams. IBM Journal of Research and Development, 24(6):660-676, November 1980.
[18] Ko-Yang Wang. A framework for static, precise performance prediction for superscalar-
based parallel computers. In 4th workshopon compilers for parallel computers, Delft,
Netherlands, December 1993.
[19] Hassler Whitney. On the abstract properties of linear dependence. American Journal
of Mathematics, 57:509-533, 1935.
[20] W.W.Hwu and S. Mahlke. Trace-based global code optimization.
In Proc. 17th an-
nual ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages
(POPL), All ACM Conferences. The OX Association for Computing Machinery, 1989.
81