Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005

advertisement
Extensibility Study Report:
Source-Level Instrumentation
Adam Leko
9/24/2005
UPC Group
HCS Research Laboratory
University of Florida
Instrumentation Levels (Review)

Source



Binary (object code)



More accurate, might not require any recompilation for user
Hard to do, tends to raise platform-specific issues, hard to relate data
back to source code
Compiler/runtime



Most flexible, can retain source-code correlation
Tends to be least accurate, can impede compiler optimizations
Potentially most accurate, compiler can do program transformations and
still correctly instrument
Requires lots of cooperation with compiler developers
Operating system

Not useful unless code relies on lots of system calls
2
Need for Source Instrumentation

Ideal case




However,




Avoid source instrumentation
Keep all possible optimizations (binary and compiler/runtime
instrumentation)
Still relate data to source code at function- and line number-levels
Compiler/runtime instrumentation (UPC) will take time to get going
Instrumenting libraries only (SHMEM) limits data that can be collected
Binary instrumentation impossible on some platforms
Need source-level instrumentation


Can get higher-level source information
Serves as a preliminary instrumentation technique for UPC+SHMEM
code now
3
Automatic Source Instrumentation

Based on our tool evaluations, found that



Need a high-quality “preprocessor” that can





Automatic instrumentation necessary for tools
Problems with automatic instrumentation can cause frustration and
decrease confidence in tool (SvPablo)
Take UPC/SHMEM code
Instrument it based on what analysis a user wants
Hand instrumented code to compiler
Can either write a parser from scratch, or use an existing system
Requirements for instrumentation system



Accurate instrumentation
Works with any valid C99/UPC program
Shouldn’t this be easy?
4
Source Instrumentation Challenges


Possible simple route: scan source code for tokens that look like
shared accesses, add instrumentation
Problems

Macro expansion (can pass through cpp before)

Scoping rules (need a good symbol table)
“Implicit” communication (shared variables aren’t treated any differently
in UPC syntax)
if/else statements without brackets, varargs with printf, interaction with
gotos and case statements, …



Trying to instrument code without a full-blown parser will result in
buggy code!


C syntax cannot be described with simple finite state machine (regular
expressions, etc)
Need a context-free grammar (parser) to correctly interpret and
instrument source code
5
Writing Parser From Scratch

Tools that can make this easier




Writing grammar for C relatively easy



flex/bison/yacc
Antlr
(many more)
A few ambiguities in C grammar, like expression vs. declaration problem
Can get around these with GLR parsers, or other tricks
Grammar isn’t everything…




Once you have parse tree, you still need to correctly interpret it
Reporting user-friendly parse errors is also difficult unless you have a
recursive-descent parser, which takes a long time to write
Supporting compiler extensions to C syntax can be difficult
Should avoid writing our own parser!
6
Some Real-World Observations
“If you do not use CIL and want instead to use just
a C parser and analyze programs expressed as
abstract-syntax trees then your analysis will
have to handle a lot of ugly corners of the
language (let alone the fact that parsing C itself
is not a trivial task).”
-- authors of CIL (parser for C99 that supports
GNU and MS extensions)
http://manju.cs.berkeley.edu/cil/
7
Some Real-World Observations [2]
“When I (George) started to write CIL I thought it was going to take two
weeks. Exactly a year has passed since then and I am still fixing
bugs in it. This gross underestimate was due to the fact that I
thought parsing and making sense of C is simple. You probably think
the same. What I did not expect was how many dark corners this
language has, especially if you want to parse real-world programs
such as those written for GCC or if you are more ambitious and you
want to parse the Linux or Windows NT sources (both of these were
written without any respect for the standard and with the expectation
that compilers will be changed to accommodate the program).”
-- authors of CIL (parser for C99 that supports GNU and MS
extensions)
http://manju.cs.berkeley.edu/cil/
8
Some Real-World Observations [3]
“I’d rather not touch the translator [for adding
instrumentation for the UPC perf. tool interface].
Many of the bugs in the Berkeley UPC compiler
are in the translator.”
-- loosely paraphrased, Dan Bonachea (talking
about the UPC performance tool interface at the
’05 UPC workshop)
[can check at http://upc-bugs.lbl.gov/bugzilla/ by
searching for translator in bug description]
9
Some Real-World Observations [4]
“Parser development is still a black art.”
--Paul Klint et. al, “Towards an engineering
discipline for GRAMMARWARE,” in ACM
TOSEM, May 2005.
10
Some Real-World Observations [5]
89% of the development is directly or
indirectly related to writing instrumentation
software.
--paraphrase, Luiz DeRose, Bernd Mohr and
Kevin London, “Performance Tools 101:
Principles of Experimental Performance
Measurement and Analysis”, SC2003
Tutorial M-11
11
Re-Use Compiler Frontend?


Observation: compilers can correctly parse and analyze C, what if
we re-use a compiler frontend?
Several candidates




Biggest argument against: complexity




GCC/GCC-UPC
Open64 (used by Berkeley UPC), Trimaran/IMPACT (uses EDG
frontend), Zephr (uses EDG frontend)
EDG frontend
GCC-UPC compiler: ~650kloc
EDG frontend: ~700kloc
Berkeley UPC: takes up 1GB when compiled(!)
Have spent a lot of time looking at EDG, GCC, and Berkeley
frontends


Are all very reliable (especially EDG)
But, very difficult to modify (too heavyweight)
12
Re-Use Compiler Frontend? [2]

Have been some other efforts to reuse compiler frontends




These methods generally produces extremely large intermediate files



For nontrivial code, can be as large as several hundred MBs, even after a
reduction phase
Example: g4re with Fluxbox source (30kloc): ~500MB intermediate files
Drawbacks




GCC-XML (missing function declarations though)
BisonXML/gccXfront (have bison output parses in XML)
gcc --fdump-translation-unit (used by g4re, is supported by GCC-UPC)
Still need to translate these intermediary files back to C/UPC
Intermediate format might change between versions of compiler
Format might also depend on names used for grammar terminals and
nonterminals (BisonXML)
Not a very attractive alternative for source instrumentation
13
Quick Review of Other Options

PDToolkit (most obvious choice)



Used by KOJAK and TAU for source instrumentation
Relies on EDG frontend (high quality, robust parser)
Some disadvantages



.PDB files can get large
.PDB files alone probably not enough to correctly instrument complicated
UPC expressions

PDToolkit is a large download (~38MB)


Current version doesn’t support UPC
After parsing by PDToolkit, still relies on scripts to read .PDB file and
correctly place instrumentation in user code
Also shares other problems for any project that relies on EDG
(will discuss later)
14
Quick Review of Other Options [2]

Keystone C++ parser




SUIF/SUIF2




C++-specific parser
Has some problems parsing real C++ code (GCC header files), might
not work with all C99 code
Large code base
Older source-to-source compiler infrastructure from Stanford
Uses EDG frontend to parse C code
Project not updated in several years
Sage++




Older source-to-source compiler infrastructure
Project seems to have been abandoned (last update: 1997)
Unlikely to support new versions of C (C99 as required by UPC spec)
Was used by TAU, but deprecated (PDToolkit now used)
15
Top 3 Candidates

After examining many systems and reading
many papers, have come up with




Cetus
EDG frontend
CIL
Will discuss each in more detail in following
slides
16
Cetus

Source-to-source compilation system written by researchers at
Stanford


Uses ANTLR parser-generator
C grammar is a modified/(similar?) version of the C grammar provided
by ANTLR


Advantages



http://www.codetransform.com/gcc.html
Cetus is written in Java
Project seems to be under active development, geared specifically
towards source-to-source transformations
Disadvantages



No built-in support for UPC (but UPC grammar a simple extension of
C99 grammar)
Not clear how robust C parser is (copyright date 1997, probably doesn’t
support C99)
Java support needed for all platforms (Cray X1 javac?)
17
Interesting Comments by Cetus Authors
“Documentation for GCC is abundant. The difficulty is that the sheer
amount easily overwhelms the user. Generally, we have found that
there is a very steep learning curve in modifying GCC, with a big
time investment to implement even trivial transformations.”
“Both SUIF and Cetus fall into the category of extensible source-tosource compilers, so at first SUIF looked like the natural choice for
our infrastructure. Three main reasons eliminated our pursuit of this
option. The first was the perception that the project is no longer
active - the last major release was in 2001 and does not appear to
have been updated recently. …”
http://paramount.www.ecn.purdue.edu/ParaMount/Cetus/manual/ch07.
html
18
EDG Frontend

Benefits


Contains full-featured C, C++, and UPC parser(!)
High-quality commercial front-end for compilers




Recursive descent
Gives good messages on syntax errors
Can understand several compiler extensions (e.g., GCC extensions)
To use for instrumentation

Basic workflow




User code is parsed by frontend executable
Executable creates intermediary representation in memory (can also store to
file)
Intermediate format (IL) is converted to executable code or source code by
backend
So need to do



Source -> IL
Instrument IL (probably in memory)
Instrumented IL -> UPC/C-generating backend
19
EDG Frontend: Drawbacks

Frontend is intellectual property of EDG




Can only be used for noncommerical projects


Redistribution of source code node allowed
We can only redistribute compiled versions for each platform
Implies we cannot support (at all) any platforms that we cannot compile
EDG’s frontend on
Would be nice to allow vendors to bundle our performance tool along
with their UPC compilers
Code is extremely complex



About ~700kloc of ANSI C code
Manual describing code and IL is over 500 pages long
Have to “pay” for added complexity because frontend also supports C++
(which we don’t need)
20
Best Candidate: CIL


Source-level analysis and transformation framework for ANSI and
C99 C code with GNU and MS compiler extensions
Heavily tested



Advantages





Successfully parses SPECINT95 benchmarks, the Linux kernel, GIMP,
many others
Fails only 23 of the GCC torture tests (GCC itself fails 19, over 900 tests)
Pretty compact code (only about 40kloc for entire system)
Ideal for adding static analyses to our performance tool
Robust, heavily tested C parser
Released under BSD license
Disadvantages


Written in Ocaml
Have to add UPC extensions to grammar files
21
Best Candidate: CIL [2]

Have already made simple modifications to code




Ocaml easy to learn




Added upc_forall statement support
Was very easy to modify grammar
We should be able to re-use much of GCC-UPC’s YACC grammar
Modern functional language with elements of imperative languages
According to Wikipedia, Ocaml commonly used for writing compilers
Seems like a language well-suited to the task
Ocaml compiler/interpreter supported on many platforms



Consists of ~2.4MB (gzipp’ed) ANSI C code
I have compiled and run the compiler and interpreter on our 32-bit Linux,
64-bit Linux (including Altix and Opteron), and Tru64 systems
Known to run on FreeBSD, OpenBSD, NetBSD, HPUX, IRIX, Solaris,
and many other platforms


http://caml.inria.fr/ocaml/portability.en.html
Ironically seems more portable than Java!
22
Best Candidate: CIL [3]

CIL brief overview







Parser is a heavily-modified version of FrontC, uses ocamllex and
ocamlyacc
C code is parsed to a simple intermediate format, CABS, that contains
type information, code structure, etc
CABS is converted to simpler subset of C code, CIL, or back to C code
Analyses are done with simplified CIL
Code can be converted back to C and fed into a C compiler
The “cilly” driver script manages this whole process
How to use CIL




Need to retain original source code structure as much as possible (keep
optimizations)
So, should do parse -> CABS -> instrumented CABS -> C code
Can keep around other code (CIL reduction, etc) if we want to do static
analysis later
Might also want to investigate keeping CIL reductions…
23
Some Ocaml Examples

Hello, world
print_endline "Hello world!";;

Factorial
let rec fact = function
| 0 -> 1
| n -> n * fact(n-1);;

Quicksort 
let rec quicksort = function
[] -> []
| head::tail -> let left, right =
List.partition (function x -> x < head) tail in
(quicksort left) @ head::(quicksort right);;

Full tutorial available at http://www.ocaml-tutorial.org/
24
Conclusions

Can’t afford to shortchange our source instrumentor



Instrumentation system vital to our tool’s success
Source instrumentation necessary until we can get binary and
library/compiler instrumentation going at full speed
Writing source instrumentation system is challenging


Must work on any C99 C or UPC code (compiler extensions nice)
Must not have bugs



Bugs == aggravated users
Empirical evidence shows we should not take this task lightly
Should reuse existing systems as much as possible



Don’t want to waste all our time writing a good source parser and
analyzer!
CIL looks like best bet right now
EDG is a good fallback option if CIL gives us problems (but preliminary
experiences have been very positive)
25
Download