Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research Laboratory University of Florida Instrumentation Levels (Review) Source Binary (object code) More accurate, might not require any recompilation for user Hard to do, tends to raise platform-specific issues, hard to relate data back to source code Compiler/runtime Most flexible, can retain source-code correlation Tends to be least accurate, can impede compiler optimizations Potentially most accurate, compiler can do program transformations and still correctly instrument Requires lots of cooperation with compiler developers Operating system Not useful unless code relies on lots of system calls 2 Need for Source Instrumentation Ideal case However, Avoid source instrumentation Keep all possible optimizations (binary and compiler/runtime instrumentation) Still relate data to source code at function- and line number-levels Compiler/runtime instrumentation (UPC) will take time to get going Instrumenting libraries only (SHMEM) limits data that can be collected Binary instrumentation impossible on some platforms Need source-level instrumentation Can get higher-level source information Serves as a preliminary instrumentation technique for UPC+SHMEM code now 3 Automatic Source Instrumentation Based on our tool evaluations, found that Need a high-quality “preprocessor” that can Automatic instrumentation necessary for tools Problems with automatic instrumentation can cause frustration and decrease confidence in tool (SvPablo) Take UPC/SHMEM code Instrument it based on what analysis a user wants Hand instrumented code to compiler Can either write a parser from scratch, or use an existing system Requirements for instrumentation system Accurate instrumentation Works with any valid C99/UPC program Shouldn’t this be easy? 4 Source Instrumentation Challenges Possible simple route: scan source code for tokens that look like shared accesses, add instrumentation Problems Macro expansion (can pass through cpp before) Scoping rules (need a good symbol table) “Implicit” communication (shared variables aren’t treated any differently in UPC syntax) if/else statements without brackets, varargs with printf, interaction with gotos and case statements, … Trying to instrument code without a full-blown parser will result in buggy code! C syntax cannot be described with simple finite state machine (regular expressions, etc) Need a context-free grammar (parser) to correctly interpret and instrument source code 5 Writing Parser From Scratch Tools that can make this easier Writing grammar for C relatively easy flex/bison/yacc Antlr (many more) A few ambiguities in C grammar, like expression vs. declaration problem Can get around these with GLR parsers, or other tricks Grammar isn’t everything… Once you have parse tree, you still need to correctly interpret it Reporting user-friendly parse errors is also difficult unless you have a recursive-descent parser, which takes a long time to write Supporting compiler extensions to C syntax can be difficult Should avoid writing our own parser! 6 Some Real-World Observations “If you do not use CIL and want instead to use just a C parser and analyze programs expressed as abstract-syntax trees then your analysis will have to handle a lot of ugly corners of the language (let alone the fact that parsing C itself is not a trivial task).” -- authors of CIL (parser for C99 that supports GNU and MS extensions) http://manju.cs.berkeley.edu/cil/ 7 Some Real-World Observations [2] “When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).” -- authors of CIL (parser for C99 that supports GNU and MS extensions) http://manju.cs.berkeley.edu/cil/ 8 Some Real-World Observations [3] “I’d rather not touch the translator [for adding instrumentation for the UPC perf. tool interface]. Many of the bugs in the Berkeley UPC compiler are in the translator.” -- loosely paraphrased, Dan Bonachea (talking about the UPC performance tool interface at the ’05 UPC workshop) [can check at http://upc-bugs.lbl.gov/bugzilla/ by searching for translator in bug description] 9 Some Real-World Observations [4] “Parser development is still a black art.” --Paul Klint et. al, “Towards an engineering discipline for GRAMMARWARE,” in ACM TOSEM, May 2005. 10 Some Real-World Observations [5] 89% of the development is directly or indirectly related to writing instrumentation software. --paraphrase, Luiz DeRose, Bernd Mohr and Kevin London, “Performance Tools 101: Principles of Experimental Performance Measurement and Analysis”, SC2003 Tutorial M-11 11 Re-Use Compiler Frontend? Observation: compilers can correctly parse and analyze C, what if we re-use a compiler frontend? Several candidates Biggest argument against: complexity GCC/GCC-UPC Open64 (used by Berkeley UPC), Trimaran/IMPACT (uses EDG frontend), Zephr (uses EDG frontend) EDG frontend GCC-UPC compiler: ~650kloc EDG frontend: ~700kloc Berkeley UPC: takes up 1GB when compiled(!) Have spent a lot of time looking at EDG, GCC, and Berkeley frontends Are all very reliable (especially EDG) But, very difficult to modify (too heavyweight) 12 Re-Use Compiler Frontend? [2] Have been some other efforts to reuse compiler frontends These methods generally produces extremely large intermediate files For nontrivial code, can be as large as several hundred MBs, even after a reduction phase Example: g4re with Fluxbox source (30kloc): ~500MB intermediate files Drawbacks GCC-XML (missing function declarations though) BisonXML/gccXfront (have bison output parses in XML) gcc --fdump-translation-unit (used by g4re, is supported by GCC-UPC) Still need to translate these intermediary files back to C/UPC Intermediate format might change between versions of compiler Format might also depend on names used for grammar terminals and nonterminals (BisonXML) Not a very attractive alternative for source instrumentation 13 Quick Review of Other Options PDToolkit (most obvious choice) Used by KOJAK and TAU for source instrumentation Relies on EDG frontend (high quality, robust parser) Some disadvantages .PDB files can get large .PDB files alone probably not enough to correctly instrument complicated UPC expressions PDToolkit is a large download (~38MB) Current version doesn’t support UPC After parsing by PDToolkit, still relies on scripts to read .PDB file and correctly place instrumentation in user code Also shares other problems for any project that relies on EDG (will discuss later) 14 Quick Review of Other Options [2] Keystone C++ parser SUIF/SUIF2 C++-specific parser Has some problems parsing real C++ code (GCC header files), might not work with all C99 code Large code base Older source-to-source compiler infrastructure from Stanford Uses EDG frontend to parse C code Project not updated in several years Sage++ Older source-to-source compiler infrastructure Project seems to have been abandoned (last update: 1997) Unlikely to support new versions of C (C99 as required by UPC spec) Was used by TAU, but deprecated (PDToolkit now used) 15 Top 3 Candidates After examining many systems and reading many papers, have come up with Cetus EDG frontend CIL Will discuss each in more detail in following slides 16 Cetus Source-to-source compilation system written by researchers at Stanford Uses ANTLR parser-generator C grammar is a modified/(similar?) version of the C grammar provided by ANTLR Advantages http://www.codetransform.com/gcc.html Cetus is written in Java Project seems to be under active development, geared specifically towards source-to-source transformations Disadvantages No built-in support for UPC (but UPC grammar a simple extension of C99 grammar) Not clear how robust C parser is (copyright date 1997, probably doesn’t support C99) Java support needed for all platforms (Cray X1 javac?) 17 Interesting Comments by Cetus Authors “Documentation for GCC is abundant. The difficulty is that the sheer amount easily overwhelms the user. Generally, we have found that there is a very steep learning curve in modifying GCC, with a big time investment to implement even trivial transformations.” “Both SUIF and Cetus fall into the category of extensible source-tosource compilers, so at first SUIF looked like the natural choice for our infrastructure. Three main reasons eliminated our pursuit of this option. The first was the perception that the project is no longer active - the last major release was in 2001 and does not appear to have been updated recently. …” http://paramount.www.ecn.purdue.edu/ParaMount/Cetus/manual/ch07. html 18 EDG Frontend Benefits Contains full-featured C, C++, and UPC parser(!) High-quality commercial front-end for compilers Recursive descent Gives good messages on syntax errors Can understand several compiler extensions (e.g., GCC extensions) To use for instrumentation Basic workflow User code is parsed by frontend executable Executable creates intermediary representation in memory (can also store to file) Intermediate format (IL) is converted to executable code or source code by backend So need to do Source -> IL Instrument IL (probably in memory) Instrumented IL -> UPC/C-generating backend 19 EDG Frontend: Drawbacks Frontend is intellectual property of EDG Can only be used for noncommerical projects Redistribution of source code node allowed We can only redistribute compiled versions for each platform Implies we cannot support (at all) any platforms that we cannot compile EDG’s frontend on Would be nice to allow vendors to bundle our performance tool along with their UPC compilers Code is extremely complex About ~700kloc of ANSI C code Manual describing code and IL is over 500 pages long Have to “pay” for added complexity because frontend also supports C++ (which we don’t need) 20 Best Candidate: CIL Source-level analysis and transformation framework for ANSI and C99 C code with GNU and MS compiler extensions Heavily tested Advantages Successfully parses SPECINT95 benchmarks, the Linux kernel, GIMP, many others Fails only 23 of the GCC torture tests (GCC itself fails 19, over 900 tests) Pretty compact code (only about 40kloc for entire system) Ideal for adding static analyses to our performance tool Robust, heavily tested C parser Released under BSD license Disadvantages Written in Ocaml Have to add UPC extensions to grammar files 21 Best Candidate: CIL [2] Have already made simple modifications to code Ocaml easy to learn Added upc_forall statement support Was very easy to modify grammar We should be able to re-use much of GCC-UPC’s YACC grammar Modern functional language with elements of imperative languages According to Wikipedia, Ocaml commonly used for writing compilers Seems like a language well-suited to the task Ocaml compiler/interpreter supported on many platforms Consists of ~2.4MB (gzipp’ed) ANSI C code I have compiled and run the compiler and interpreter on our 32-bit Linux, 64-bit Linux (including Altix and Opteron), and Tru64 systems Known to run on FreeBSD, OpenBSD, NetBSD, HPUX, IRIX, Solaris, and many other platforms http://caml.inria.fr/ocaml/portability.en.html Ironically seems more portable than Java! 22 Best Candidate: CIL [3] CIL brief overview Parser is a heavily-modified version of FrontC, uses ocamllex and ocamlyacc C code is parsed to a simple intermediate format, CABS, that contains type information, code structure, etc CABS is converted to simpler subset of C code, CIL, or back to C code Analyses are done with simplified CIL Code can be converted back to C and fed into a C compiler The “cilly” driver script manages this whole process How to use CIL Need to retain original source code structure as much as possible (keep optimizations) So, should do parse -> CABS -> instrumented CABS -> C code Can keep around other code (CIL reduction, etc) if we want to do static analysis later Might also want to investigate keeping CIL reductions… 23 Some Ocaml Examples Hello, world print_endline "Hello world!";; Factorial let rec fact = function | 0 -> 1 | n -> n * fact(n-1);; Quicksort let rec quicksort = function [] -> [] | head::tail -> let left, right = List.partition (function x -> x < head) tail in (quicksort left) @ head::(quicksort right);; Full tutorial available at http://www.ocaml-tutorial.org/ 24 Conclusions Can’t afford to shortchange our source instrumentor Instrumentation system vital to our tool’s success Source instrumentation necessary until we can get binary and library/compiler instrumentation going at full speed Writing source instrumentation system is challenging Must work on any C99 C or UPC code (compiler extensions nice) Must not have bugs Bugs == aggravated users Empirical evidence shows we should not take this task lightly Should reuse existing systems as much as possible Don’t want to waste all our time writing a good source parser and analyzer! CIL looks like best bet right now EDG is a good fallback option if CIL gives us problems (but preliminary experiences have been very positive) 25