Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005

Instrumentation Levels (Review)
Binary (object code)
More accurate, might not require any recompilation for user
Hard to do, tends to raise platform-specific issues, hard to relate data
back to source code
Most flexible, can retain source-code correlation
Tends to be least accurate, can impede compiler optimizations
Potentially most accurate, compiler can do program transformations and
still correctly instrument
Requires lots of cooperation with compiler developers
Operating system
Not useful unless code relies on lots of system calls
Need for Source Instrumentation
Ideal case
Avoid source instrumentation
Keep all possible optimizations (binary and compiler/runtime
Still relate data to source code at function- and line number-levels
Compiler/runtime instrumentation (UPC) will take time to get going
Instrumenting libraries only (SHMEM) limits data that can be collected
Binary instrumentation impossible on some platforms
Need source-level instrumentation
Can get higher-level source information
Serves as a preliminary instrumentation technique for UPC+SHMEM
code now
Automatic Source Instrumentation
Based on our tool evaluations, found that
Need a high-quality “preprocessor” that can
Automatic instrumentation necessary for tools
Problems with automatic instrumentation can cause frustration and
decrease confidence in tool (SvPablo)
Take UPC/SHMEM code
Instrument it based on what analysis a user wants
Hand instrumented code to compiler
Can either write a parser from scratch, or use an existing system
Requirements for instrumentation system
Accurate instrumentation
Works with any valid C99/UPC program
Shouldn’t this be easy?
Source Instrumentation Challenges
Possible simple route: scan source code for tokens that look like
shared accesses, add instrumentation
Macro expansion (can pass through cpp before)
Scoping rules (need a good symbol table)
“Implicit” communication (shared variables aren’t treated any differently
in UPC syntax)
if/else statements without brackets, varargs with printf, interaction with
gotos and case statements, …
Trying to instrument code without a full-blown parser will result in
buggy code!
C syntax cannot be described with simple finite state machine (regular
expressions, etc)
Need a context-free grammar (parser) to correctly interpret and
instrument source code
Writing Parser From Scratch
Tools that can make this easier
Writing grammar for C relatively easy
(many more)
A few ambiguities in C grammar, like expression vs. declaration problem
Can get around these with GLR parsers, or other tricks
Grammar isn’t everything…
Once you have parse tree, you still need to correctly interpret it
Reporting user-friendly parse errors is also difficult unless you have a
recursive-descent parser, which takes a long time to write
Supporting compiler extensions to C syntax can be difficult
Should avoid writing our own parser!
Some Real-World Observations
“If you do not use CIL and want instead to use just
a C parser and analyze programs expressed as
abstract-syntax trees then your analysis will
have to handle a lot of ugly corners of the
language (let alone the fact that parsing C itself
is not a trivial task).”
-- authors of CIL (parser for C99 that supports
GNU and MS extensions)
Some Real-World Observations [2]
“When I (George) started to write CIL I thought it was going to take two
weeks. Exactly a year has passed since then and I am still fixing
bugs in it. This gross underestimate was due to the fact that I
thought parsing and making sense of C is simple. You probably think
the same. What I did not expect was how many dark corners this
language has, especially if you want to parse real-world programs
such as those written for GCC or if you are more ambitious and you
want to parse the Linux or Windows NT sources (both of these were
written without any respect for the standard and with the expectation
that compilers will be changed to accommodate the program).”
-- authors of CIL (parser for C99 that supports GNU and MS
Some Real-World Observations [3]
“I’d rather not touch the translator [for adding
instrumentation for the UPC perf. tool interface].
Many of the bugs in the Berkeley UPC compiler
are in the translator.”
-- loosely paraphrased, Dan Bonachea (talking
about the UPC performance tool interface at the
’05 UPC workshop)
[can check at http://upc-bugs.lbl.gov/bugzilla/ by
searching for translator in bug description]
Some Real-World Observations [4]
“Parser development is still a black art.”
--Paul Klint et. al, “Towards an engineering
discipline for GRAMMARWARE,” in ACM
TOSEM, May 2005.
Some Real-World Observations [5]
89% of the development is directly or
indirectly related to writing instrumentation
--paraphrase, Luiz DeRose, Bernd Mohr and
Kevin London, “Performance Tools 101:
Principles of Experimental Performance
Measurement and Analysis”, SC2003
Tutorial M-11
Re-Use Compiler Frontend?
Observation: compilers can correctly parse and analyze C, what if
we re-use a compiler frontend?
Several candidates
Biggest argument against: complexity
Open64 (used by Berkeley UPC), Trimaran/IMPACT (uses EDG
frontend), Zephr (uses EDG frontend)
EDG frontend
GCC-UPC compiler: ~650kloc
EDG frontend: ~700kloc
Berkeley UPC: takes up 1GB when compiled(!)
Have spent a lot of time looking at EDG, GCC, and Berkeley
Are all very reliable (especially EDG)
But, very difficult to modify (too heavyweight)
Re-Use Compiler Frontend? [2]
Have been some other efforts to reuse compiler frontends
These methods generally produces extremely large intermediate files
For nontrivial code, can be as large as several hundred MBs, even after a
reduction phase
Example: g4re with Fluxbox source (30kloc): ~500MB intermediate files
GCC-XML (missing function declarations though)
BisonXML/gccXfront (have bison output parses in XML)
gcc --fdump-translation-unit (used by g4re, is supported by GCC-UPC)
Still need to translate these intermediary files back to C/UPC
Intermediate format might change between versions of compiler
Format might also depend on names used for grammar terminals and
nonterminals (BisonXML)
Not a very attractive alternative for source instrumentation
Quick Review of Other Options
PDToolkit (most obvious choice)
Used by KOJAK and TAU for source instrumentation
Relies on EDG frontend (high quality, robust parser)
Some disadvantages
.PDB files can get large
.PDB files alone probably not enough to correctly instrument complicated
UPC expressions
PDToolkit is a large download (~38MB)
Current version doesn’t support UPC
After parsing by PDToolkit, still relies on scripts to read .PDB file and
correctly place instrumentation in user code
Also shares other problems for any project that relies on EDG
(will discuss later)
Quick Review of Other Options [2]
Keystone C++ parser
C++-specific parser
Has some problems parsing real C++ code (GCC header files), might
not work with all C99 code
Large code base
Older source-to-source compiler infrastructure from Stanford
Uses EDG frontend to parse C code
Project not updated in several years
Older source-to-source compiler infrastructure
Project seems to have been abandoned (last update: 1997)
Unlikely to support new versions of C (C99 as required by UPC spec)
Was used by TAU, but deprecated (PDToolkit now used)
Top 3 Candidates
After examining many systems and reading
many papers, have come up with
EDG frontend
Will discuss each in more detail in following
Source-to-source compilation system written by researchers at
Uses ANTLR parser-generator
C grammar is a modified/(similar?) version of the C grammar provided
Cetus is written in Java
Project seems to be under active development, geared specifically
towards source-to-source transformations
No built-in support for UPC (but UPC grammar a simple extension of
C99 grammar)
Not clear how robust C parser is (copyright date 1997, probably doesn’t
support C99)
Java support needed for all platforms (Cray X1 javac?)
Interesting Comments by Cetus Authors
“Documentation for GCC is abundant. The difficulty is that the sheer
amount easily overwhelms the user. Generally, we have found that
there is a very steep learning curve in modifying GCC, with a big
time investment to implement even trivial transformations.”
“Both SUIF and Cetus fall into the category of extensible source-tosource compilers, so at first SUIF looked like the natural choice for
our infrastructure. Three main reasons eliminated our pursuit of this
option. The first was the perception that the project is no longer
active - the last major release was in 2001 and does not appear to
have been updated recently. …”
EDG Frontend
Contains full-featured C, C++, and UPC parser(!)
High-quality commercial front-end for compilers
Recursive descent
Gives good messages on syntax errors
Can understand several compiler extensions (e.g., GCC extensions)
To use for instrumentation
Basic workflow
User code is parsed by frontend executable
Executable creates intermediary representation in memory (can also store to
Intermediate format (IL) is converted to executable code or source code by
So need to do
Source -> IL
Instrument IL (probably in memory)
Instrumented IL -> UPC/C-generating backend
EDG Frontend: Drawbacks
Frontend is intellectual property of EDG
Can only be used for noncommerical projects
Redistribution of source code node allowed
We can only redistribute compiled versions for each platform
Implies we cannot support (at all) any platforms that we cannot compile
EDG’s frontend on
Would be nice to allow vendors to bundle our performance tool along
with their UPC compilers
Code is extremely complex
About ~700kloc of ANSI C code
Manual describing code and IL is over 500 pages long
Have to “pay” for added complexity because frontend also supports C++
(which we don’t need)
Best Candidate: CIL
Source-level analysis and transformation framework for ANSI and
C99 C code with GNU and MS compiler extensions
Heavily tested
Successfully parses SPECINT95 benchmarks, the Linux kernel, GIMP,
many others
Fails only 23 of the GCC torture tests (GCC itself fails 19, over 900 tests)
Pretty compact code (only about 40kloc for entire system)
Ideal for adding static analyses to our performance tool
Robust, heavily tested C parser
Released under BSD license
Written in Ocaml
Have to add UPC extensions to grammar files
Best Candidate: CIL [2]
Have already made simple modifications to code
Ocaml easy to learn
Added upc_forall statement support
Was very easy to modify grammar
We should be able to re-use much of GCC-UPC’s YACC grammar
Modern functional language with elements of imperative languages
According to Wikipedia, Ocaml commonly used for writing compilers
Seems like a language well-suited to the task
Ocaml compiler/interpreter supported on many platforms
Consists of ~2.4MB (gzipp’ed) ANSI C code
I have compiled and run the compiler and interpreter on our 32-bit Linux,
64-bit Linux (including Altix and Opteron), and Tru64 systems
Known to run on FreeBSD, OpenBSD, NetBSD, HPUX, IRIX, Solaris,
and many other platforms
Ironically seems more portable than Java!
Best Candidate: CIL [3]
CIL brief overview
Parser is a heavily-modified version of FrontC, uses ocamllex and
C code is parsed to a simple intermediate format, CABS, that contains
type information, code structure, etc
CABS is converted to simpler subset of C code, CIL, or back to C code
Analyses are done with simplified CIL
Code can be converted back to C and fed into a C compiler
The “cilly” driver script manages this whole process
How to use CIL
Need to retain original source code structure as much as possible (keep
So, should do parse -> CABS -> instrumented CABS -> C code
Can keep around other code (CIL reduction, etc) if we want to do static
analysis later
Might also want to investigate keeping CIL reductions…
Some Ocaml Examples
Hello, world
print_endline "Hello world!";;
let rec fact = function
| 0 -> 1
| n -> n * fact(n-1);;
Quicksort 
let rec quicksort = function
[] -> []
| head::tail -> let left, right =
List.partition (function x -> x < head) tail in
(quicksort left) @ head::(quicksort right);;
Full tutorial available at http://www.ocaml-tutorial.org/
Can’t afford to shortchange our source instrumentor
Instrumentation system vital to our tool’s success
Source instrumentation necessary until we can get binary and
library/compiler instrumentation going at full speed
Writing source instrumentation system is challenging
Must work on any C99 C or UPC code (compiler extensions nice)
Must not have bugs
Bugs == aggravated users
Empirical evidence shows we should not take this task lightly
Should reuse existing systems as much as possible
Don’t want to waste all our time writing a good source parser and
CIL looks like best bet right now
EDG is a good fallback option if CIL gives us problems (but preliminary
experiences have been very positive)