Advanced Compiler Design and Implementation Introduction to Advanced Topics Chapter 1 Eran Yahav Course textbook: Advanced compiler design and Implementation Steven S. Muchnick ISBN 1558603204 Two copies available in TAU library. Purchase your own copy - send e-mail to yahave@math.tau.ac.il Outline Review of compiler structure Advanced issues in elementary topics The importance of optimizations Structure of optimizing compilers Placement of optimizations Tentative schedule Course requirements 1 Compiler Structure Character string Scanner Tokens Parser Symbol table and access routines OS Interface AST Semantic Analyzer IR Code Generator Object code Figure 1 - General structure of a non-optimizing one-pass compiler This section will give a short reminder of basic compiler structure and basic terminology. If you are not familiar with the terms, you can find more elaborate description in [1] or [2]. Lets start by a describing the structure of a non-optimizing one-pass compiler. The structure of such a compiler is brought in details in Figure 1. A simple non-optimizing one-pass compiler consists of the following components/phases: 1. Scanner (also known as Lexical Analyzer) - input for this phase is the stream of characters representing the source program. During scanning, the stream is read sequentially and characters are grouped into meaningful tokens. Output of this phase is a token stream, to be analyzed by the parser. 2. Parser (also known as Syntax Analyzer) - input for this phase is a stream of tokens (usually produced by the lexical analyzer). During parsing, tokens are grouped into grammatical phrases. The parsing phase produces abstract syntax tree (AST) representation of the program. 3. Semantic Analyzer - input for this phase is the AST, produced by the parser. The semantic analysis phase verifies the program in terms of semantic errors, and collects type information later used by the code-generator. Semantic analysis usually produces some form of intermediate representation (IR) of the program. 4. Code Generator - input for this phase is the intermediate representation (IR), produced by the semantic analyzer. The code generator produces target code using the intermediate representation. 2 The other two components in Figure 1 - symbol table, access routines and OS interface are used by most of compiler phases. The symbol table and access routines record identifiers used in the source program, and collect information about various attributes of each identifier. A symbol table is a data structure containing a record for each identifier with fields for the attributes of the identifier. A naive implementation of a symbol table is quite trivial, but, as we shall later see, efficient symbol table management is not at all simple. Note that the symbol table (and access routines) is usually used by all compilation phases, for example: the lexical analyzer detects the existence of an identifier in the source program, and creates the corresponding record in the symbol table. However, most of the identifier attributes will be only detected by later phases. The OS interface is used by the compiler to handle files, error handling at the OS level etc. A more sophisticated compiler would probably contain additional components, such as a code optimizer. Advanced Issues in Elementary Topics Although the compiler structure described in the previous section is quite simple, there are some advanced issues to be investigated (textbook chapter brought in parenthesis): 1. Symbol table management (3) Efficiency - the symbol table can be trivially implemented as a hash table with identifier names used as a hashing index. Is that the best implementation? Can we use other data structure to achieve better performance? Overloading - how do we handle overloading (the problem is demonstrated in the following section) 2. Type Inference - all compilers use some degree of type inference. The problem of type inference can be simply described as - determine the type of an expression based on the types of operands participating in the expression. Although many cases seem to straightforward, in object oriented languages, type inference can get very complicated. 3. Intermediate Language Selection (4) - selecting an intermediate representation can affect the quality of target code generated, and quality of optimizations performed. 4. Run-Time Support(5) - the data structures used by the compiler to support the program at runtime, representation of objects during run-time etc. 5. Producing Code Generators (6) 3 Symbol table management procedure BACH is procedure put (x: boolean) is begin null; end; procedure put (x: float) is begin null; end; procedure put (x: integer) is begin null; end; package x is type boolean is (false, true); function f return boolean; -- (D1) end x; package body x is function f return boolean is begin null; end; end x; function f return float is begin null; end; -- (D2) use x; begin put (f); -- (A1) A: declare f: integer; -- (D3) begin put (f); -- (A2) B: declare function f return integer is begin null; end; -- (D4) begin put (f); -- (A3) end B; end A; end BACH; Figure 2 - sample program - is symbol table management trivial? Symbol table management should take overloading into account. Figure 2 shows a sample program in which a number of identifiers "f" are defined, In some program points f is a function, in other, it is a simple integer variable. However, in each program point, the compiler should resolve f to a single entry in the symbol table according to the language semantics. For example, lets examine the call site denoted "A1". Which f is invoked at "A1"? The function declared at "D1" or that defined in "D2"? Note that the put procedure is overloaded to handle types boolean, float and integer. f[D1] is declared with a boolean return type. So is it f[D1] that will be invoked with a put(boolean) procedure ? Taking a careful look at f[D1] reveals that the boolean type returned by f[D1] is a specified type defined inside that package x, rather than the "global" boolean defined in Ada. Since f[D1] does not return a "global" boolean type, there is no implementation of put that can handle it. Therefore, the site "A1" will invoke put(float) with f[D2]. What about "A2" ? the site "D3" added yet another symbol f to the symbol table, and now as an integer variable. According to Ada's semantics, a locally defined symbol "masks" other identifiers with identical names. In short, the site "A2" will invoke put(integer) on the integer variable f defined at "D3". Finally, lets take a look at "A3". Now we have four different definitions of f in our program. Which f will be used at "A3" ? Again, language definition comes to our aid, and locality of definition helps us to resolve f to f[D4] and put to put(integer). 4 As can be seen from our simple example, symbol table management could be quite tricky. This subject is covered in details in chapter 3 of the textbook. Type Inference Type inference - the problem of determining the type of a language construct from the way it is used. This term is often applied to the problem of inferring the type of an expression from its arguments or the type of a function from its body. type link ^cell; procedure mlist (lptr :link; procedure p); begin while lptr <> nil do begin p(lptr); lptr := lptr^.next end end; Figure 3 - type inference example - procedure mlist with procedure parameter p (dragon book) Figure 3 is an example of type-inference in Pascal, the example was taken from [2]. The procedure mlist takes as parameter another procedure p. By looking at mlist's header, we do not know the number or type of parameters taken by procedure p. This kind of incomplete specification of the type of procedure p is allowed by C and Pascal reference manual. Since mlist invokes p, we can infer the type of p from the use of p inside mlist. p is used in the expression p(lptr), therefore, the type of p must be link → void, i.e., the function p maps values from type link to type void (p is a procedure and therefore its return type is void). 5 Intermediate Language Selection Low Vs. High level control flow structures Flat Vs. Hierarchical (tree) Machine Vs. High level of instructions (Symbolic) Registers Vs. Stack Normal forms (SSA) Intermediate forms: Control Flow Graph, Call Graph, Program Dependence Graph Issues: Engineering, efficiency, portability, optimization level IRs in the Book for v v1 by v2 to v3 do a[i] :=2 endfor v v1 t2 v2 t3 v3 L1: if v >t3 goto L2 t4 addr a t5 4*i t6 t4+t5 *t6 2 v v + t2 goto L1 L2: s2 s1 s4 s3 s6 s5 L1: if s2 >s6 goto L2 s7 addr a s8 4*s9 s10 s7+s8 [s10] 2 s2 s2 + s4 goto L1 L2: High Intermediate Representation (HIR) Medium Intermediate Representation (MIR) Low Intermediate Representation (LIR) Figure 4 - different levels of intermediate representation Figure 4 shows an example of three different levels of intermediate representation as described in the textbook. The high-level intermediate representation (HIR) uses a high level structure with loops and arrays subscripting. The medium level, uses a high level structure with a low level control flow. Note that the loop control structure is no longer available at the MIR level, and the array access is now expressed in terms of pointer offsets. The low level uses a lower representation, by taking variables and assigning them to symbolic registers. 6 Single Static Assignment Form (SSA) Single static assignment form, as its name may hint, is a form of representation in which every variable (or symbolic register) is assigned a value only once. This representation is easy to create by the compiler, and makes is easy to determine for each variable the site in which it was assigned a value. The key idea in this form is to use the non-deterministic selection function where more than one value is available for assignment. s2 s1 s4 s3 s6 s5 L1: if s2 >s6 goto L2 s7 addr a s8 4*s9 s10 s7+s8 [s10] 2 s2 s2 + s4 goto L1 L2: s21 s1 s4 s3 s4 s3 s22 (s21 , s23) s22>=s6 Y N s7 addr a s8 4*s9 s10 s7+s8 [s10] 2 s23 s22 + s4 (a) (b) Figure 5 - Single static assignment form The above figure shows an example of SSA representation of a LIR program. The SSA form given in (b) represents the LIR given in (a). Note the use of the non-deterministic selection function to choose a value from s21 , the value assigned at the loop header, and s23, the value assigned in the loop body. Run-Time Support Data representation and instructions Register usage Stack frames (activation records) Parameter passing disciplines Symbolic and polymorphic language support 7 The Importance of Optimizations Using a non-optimizing one-pass compiler as described in Figure 1 will usually result a non-efficient code when compared to more sophisticated compilers. Figure 6 shows the code generated for the simple C program on an expression by expression basis. As can be easily seen, the non-optimized SPARC code uses 7 instructions, 5 of which are memory access instructions. The non-optimized SPARC code does not take advantage of pre-calculated values (subexpression for c) or efficient register allocation. The optimized code is much more efficient and uses only 2 instructions. In terms of run-time improvement, the non-optimized code takes 10 cycles and the optimized code only 2 cycles! int a, b, c, d; c = a + b; d = c + 1; ldw ldw add stw ldw add stw a, r1 b, r2 r1, r2, r3 r3, c c, r3 r3, 1, r4 r4, d C code SPARC code 10 Cycles add r1, r2, r3 add r3, 1, r4 Optimized SPARC code 2 Cycles Figure 6 - optimization benefits In some compilers (or programming languages), it is common for the programmer to aid the compiler by giving optimization "hints". Examples for such hints are the with clause used in Pascal, or the fact that loop index variables are not guaranteed to exist after the loop (this allows the compiler to allocate the index in a register that its value is not stored as the loop ends). Optimizations based on programmer hints are no longer popular. All modern programming styles encourage the use of simple and readable code rather than an optimized non-readable code with compiler hints or constructs chosen to aid the compiler. Furthermore, as complexity of programs is constantly increasing, it is not likely for programmers to keep optimizing their programs manually. Letting the compiler work harder (rather than the programmer) and apply optimizations on a "readable" program may open the way for simpler machines and more efficient use of the target machine (RISC, VLIW). Modularity of modern compilers gives another motivation for using compiler optimizations. Modern compilers often generate non-optimized code to allow reuse of compiler components. If a certain component applies an optimization, it may omit data needed by following components. Furthermore, since the compiler needs to be modular, it consists of distinct modules, each handling specific tasks, and possibly causing code generation to be local (similar to code generation shown in Figure 6). 8 Application Dependent Optimizations Different applications/programs may require different optimizations. Often, one wants to develop an application dependent optimization that is targeted towards a specific class of applications. For example, functional programs can be optimized by replacement of (expensive) recursion by loops (tailcalls elimination). Another common optimization for functional programs is replacement of heap by stack (heap allocation and access is always more expensive than stack access). Object-oriented programs performance can be improved by dead member elimination (for example see [3]) and replacement of virtual by static function (for example see [?]). In addition to the above, specific optimizations can be applied on specific types of applications such as numeric code or database access code. Mixed vs. Low Level Optimizers The location of the optimization phase is another interesting issue. Should we place the optimizer before code generation, after code generation, both? There are two common approaches for placing optimization phase(s): 1. Optimizer can be placed after code generation (low-level). Low level optimization is used by HP-PA-RISC/IBM-Power PC. Low level optimization is considered as yielding more efficient code, and considered as conceptually simpler (in RISC). Since optimization is done at the lowest level (after code generation), specific programming language details do not exist and the optimization can be used for multiple programming languages. 2. Optimizer can be divided to two phases, one preceding code generation, and the other following the code generation phase (see Figure 7). Such optimizer performs both high-level (usually architecture independent) and low-level (architecture dependent) optimization, and thus called mixed optimizer. Mixed optimizers are used by Sun-SPARC, Dec. Alpha, SGI-MIPS, Intel’s 386. Mixed optimizers are considered easier to port since at least part of the optimizer is architecture independent. Since part of the optimization is performed at high-level, it can take advantage of high-level information to achieve more efficient compilation. Mixed optimization supports CISC. The approaches described above are shown in Figure 7. In both approaches, the optimizer analyses the intermediate code and tries to take advantage of the specific case introduced. For example consider loop invariant - an expression inside a loop with a result that is independent of loop iteration. Such an expression could be extracted from the loop and evaluated only once. In a loop with large number of iterations, evaluating any expression, simple as it may be, may have a considerable effect on performance. Elimination of such redundant recalculation is only one example of optimizations that are both simple and efficient. 9 Character string Character string Scanner Scanner Tokens Tokens Parser Parser AST AST Semantic Analyzer Semantic Analyzer AST AST LIR Generator IR Generator LIR MIR Optimizer Optimizer LIR MIR Final Assembly Code Generator Object LIR Post Optimizer Object Figure 7 - location of the optimization phase 10 Translation by Preprocessing Source to source translation of one programming language to another can produce "cheap" compilation solutions by using a translator and an existing compiler of the target language. For example, some programming languages are translated into C to be later compiled by a standard C compiler. This allows the language developer to write a translator rather than a whole compiler package. Real world examples are elimination of includes by C preprocessor, translation of C++ to C (cfront), Haskel, translation of Fortran into “vector” Fortran program etc. C is very comfortable as an "intermediate language" due to its support of some indirect source level debugging (such as the #line directive). Indirect source level debugging allows showing the original source program, rather than source of the C program (produced by translating the original source program). By using compiler directives such as #line, we maintain a relationship between the translation (C program) and the original source program files. This relationship can be later used to identify which source program statements are related to statements of the target (C) program. Data-Cache Optimizations (20) Processors are getting faster all the time. Memory is lagging behind. Differences between main memory and cache access time are dramatic. By causing data (or instructions) to be read from cache rather than from main memory, one can significantly improve performance. Data-cache optimizations are usually most effective when applied to high-level intermediate code (or source code). Character Figure 8 shows the phases of an string optimizing compiler where datacache optimization is applied to Scanner high-level code. The IBM PowerPC compiler uses a de-compiler to translate a low-level intermediate Tokens representation to a higher-level intermediate representation, then using the higher-level IR for dataParser cache optimization. This allows IBM to place data-cache optimization after low-level IR AST generator, with the rest of compiler optimizations applied at that phase. The IBM PowerPC compiler is Semantic sketched in Figure 9. Analyzer HIR Data-cache Optimizer MIR … Figure 8 - data cache optimization applied on high-level IR 11 Character string Scanner Tokens Parser AST Semantic Analyzer AST LIR Generator Low to high LIR = XIL HIR = YIL Optimizer Data cache optimizer LIR HIR = YIL Final Assembly High to low Object Figure 9 - IBM PowerPC compiler 12 Placement of optimizations It is clear that some optimizations are dependent of each other. One optimization can create optimization opportunities for another, but it might as well destroy optimization opportunities. We want to find an ordering of the applied optimizations in which combined optimizations yield betteroptimized code, and do not clash with each other. A Scalar replacement of array references Data-cache optimizations B Procedure integration Tail-call elimination Scalar replacement of aggregates Sparse constant propagation Interprocedural constant propagation Procedure specialization and cloning Sparse conditional constant propagation C High Level Global value numbering Local and global copy propagation Sparse conditional constant propagation Dead code elimination Common Subexpression Elimination Loop invariant code motion (or partial redundancy elimination) D E Inline expansion Leaf-routine optimizations Instruction Scheduling 1 Register allocation Instruction Scheduling 2 Intraprocedural I-cache optimizations Instruction prefetching Data prefertching Branch predication Interprocedural register allocation Interprocedural I-cache optimization Low Level Figure 10 - placement of optimizations example Figure 10 gives an example of optimization placement in a compiler. Optimizations are applied starting with the high-level optimizations and going down to the lower level ones. Note that some optimizations are applied more than once (such as sparse conditional constant propagation). 13 Optimizations in Figure 10 are labeled according to the type of code for which they apply: A - high-level optimizations that are applied on high-level IR or source code. Usually applied during early compilation stages. B - medium-level or low-level optimizations. D - usually applied on low-level IR. E - performed at link time. Applied to object code. Figure 11 and Figure 12 demonstrate the significance of combining optimizations, and significance of optimization order. Figure 11 (a) shows a simple program after constant propagation. Note that evaluation path (marked with thick edges) is determined due to constant propagation - the constant a1 is propagated to the condition a=1. However, constant propagation (usually) does not work on aggregate types, and the value B.x1.0 is not propagated to the following conditional (B.x=1.0). Part (b) of the figure shows the simple program after scalar replacement (after constant propagation was already applied). Scalar replacement replaces the term B.x by c. Since c is now a scalar value, it is subject to the second pass of constant propagation, and the condition c=1.0 is evaluated to true using the value c1.0 assigned in previous block. c B.x a1 a=1 a1 a=1 N Y B.x 1.0 B.y 1.0 c 1.0 B.y 1.0 read B.x read B.y B.x = 1.0 Y N Y read B.x read B.y c = 1.0 N Y (a) After constant propagation N (b) After scalar replacement Figure 11 - Scalar replacement after constant propagation 14 d B.x B.x 1.0 c2 read B.y B.x = 1.0 Y d 1.0 c2 read B.y d = 1.0 N Y N dd +c B.x B.x + c (a) Before optimization (b) After scalar replacement and constant propagation Figure 12 - Scalar replacement before constant propagation Figure 12 (a) shows a simple program, before any optimization is applied. Part (b) of the figure shows application of scalar replacement and constant propagation (in that order). First, the term B.x is replaced by a scalar d. Constant propagation is then applied and propagates the value 1.0 of d (assigned by d1.0) to the condition d=1.0, which then evaluates to true. Theoretically Open Questions Picking the “right” order Combining optimizations Proving correctness Tentative Schedule 30/3 Introduction(1) 13/4 No class (read Chapter 2 and solve Homework #1) 20/4 Yom Hazikaron 24/4 9am, Intermediate Representations(4) 27/4 Control Flow Analysis (7) 4/5 Data Flow Analysis (8) 11/5 Dependence Analysis and Dependence Graphs(9) 18/5 Introduction to Optimization(11) 25/5 Early Optimizations (12) 1/6 Redundancy Elimination (13) 8/6 Loop Optimizations (14) 7/6 Interprocedural Analysis and Optimizations (19) 14/6 Optimizations for memory hierarchy (20) 21/6 Case Studies of Compilers and Future Trends (21) 15 Uncovered Chapters Symbol tables(3) Run-time support(5) Producing code generators automatically (6) Alias analysis(10) Procedure optimizations (15) Register allocation (16) Code scheduling (17) Control-Flow and Low-Level Optimizations (18) Uncovered Topics Code profiling Source level debugging Parallelization and vectorization Just in time compilation Course Requirements Prepare course notes 10% Theoretical assignments 30% Final exam 60% References1 1. S.S. Muchnick. Advanced Compiler Design and Implementation. 2. A.V. Aho. Compilers - Principles, Techniques and Tools. 3. P.F. Sweeny and F. Tip. A Study of Dead Data Members in C++ Applications. In Proceedings of the 1998 Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998. 4. G. Aigner and U. Holzle. Eliminating Virtual Functions Call in C++ Programs. Technical Report TRCS 95-22, Department of computer science, University of California, Santa Barbara, December 1995. 1 References for first lecture notes 16