Chapter 1. Introduction J. H. Wang Sep. 10, 2008 Outline • Language Processors • The Structure of a Compiler • The Evolution of Programming Languages • The Science of Building a Compiler • Applications of Compiler Technology • Programming Languages Basics Language Processors • A compiler source program Compiler target program • Running the target program input Target Program output • An interpreter – Much slower program execution – Better error diagnostics source program input Interpreter output • A hybrid compiler, e.g. Java source program Translator intermediate program input Virtual Machine output A Language Processing System source program Preprocessor modified source program Compiler target assembly program Assembler relocatable machine code Linker/Loader target machine code library files relocatable object files The Structure of a Compiler • Analysis – Front end – Using a grammatical structure to create an intermediate representation – Collecting information about the source program in a symbol table • Synthesis – Back end – Constructing the target program from the intermediate representation and the symbol table Phases of a Compiler character stream Lexical Analyzer token stream Syntax Analyzer syntax tree Symbol Table (optional) Semantic Analyzer syntax tree Intermediate Code Generator Machine-Independent Code Optimization intermediate representation Code Generator target machine code Machine-Dependent Code Optimization (optional) Lexical Analysis (Scanning) • Grouping characters into lexemes • Producing tokens – (token-name, attribute-value) • E.g. – position = initial + rate * 60 – <id,1> <=> <id,2> <+> <id,3> <*> <60> Syntax Analysis (Parsing) • Creating a tree-like (e.g. syntax tree) intermediate representation that depicts the grammatical structure of the token streams – E.g. – <id,1> <=> <id,2> <+> <id,3> <*> <60> = – + <id, 1> * <id, 2> <id, 3> 60 Semantic Analysis • Type checking • Type conversions or coercions • E.g. – = + <id, 1> * <id, 2> <id, 3> int2float 60 Intermediate Code Generation • Generating a low-level intermediate representation – It should be easy to produce – It should be easy to translate into the target machine – E.g. three-address code (in Chap. 6) • t1 = int2float(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3 Code Optimization • Attempts to improve the intermediate code – Better: faster, shorter code, or code that consumes less power (Chap. 8 -) – E.g. • t1 = id3 * 60.0 id1 = id2 + t1 • Code Generation • Mapping intermediate representation of the source program into the target language (Chap. 8) – Machine code: register/memory location assignments – E.g. • LDF R2, id3 MULF R2, R2, #60.0 LDF R1, id2 ADDF R1, R1, R2 STF id1, R1 Symbol Table Management • To record the variable names and collect information about various attributes of each name – Storage, type, scope – Number and types of arguments, method of argument passing, and the type returned • (Chap. 2) Grouping of Phases into Passes • Front-end pass – Lexical analysis, syntax analysis, semantic analysis, intermediate code generation • (Optional) Code optimization pass • Back-end pass – Code generation Compiler-Construction Tools • Parser generators • Scanner generators • Syntax-directed translation engines • Code-generator generators • Data-flow analysis engines – A key part of code optimization • Compiler construction toolkits The Evolution of Programming Languages • Machine language: 1940’s • Assembly language: early 1950’s • Higher-level languages: late 1950’s – Fortran: scientific computation – Cobol: business data processing – Lisp: symbolic computation • Today: thousands of programming languages Classification of Programming Languages – by Generation • First generation: machine languages • Second generation: assembly languages • Third generation: high-level languages – Fortran, Cobol, Lisp, C, C++, C#, Java • Fourth generation: specific application – NOMAD, SQL, Postscript • Fifth generation: logic- and constraintbased – Prolog, OPS5 Classification of Programming Languages - by Functions • Imperative: how – C, C++, C#, Java • Declarative: what – ML, Haskell, Prolog • von Neumann language – Fortran, C • Object-oriented language – Simula 67, Smalltalk, C++, C#, Java, Ruby • Scripting languages – Awk, JavaScript, Perl, PHP, Python, Ruby, Tcl Impacts on Compilers • To translate and support new language features • To take advantage of new hardware capabilities • To promote the use of high-level languages by minimizing the execution overhead • To make high-performance computer architectures effective on users’ applications • To evaluate architectural concepts The Science of Building a Compiler • How abstractions can be used to solve problems – Take problem – Formulate a mathematical abstraction that captures the key characteristics – Solve it using mathematical techniques Modeling in Compiler Design and Implementation • To design the right mathematical models and choose the right algorithms – Finite-state machines and regular expressions (Chap. 3) – Context-free grammars (Chap. 4) – Trees (Chap. 5) The Science of Code Optimization • “optimization”: attempts to produce code that is more efficient than the obvious code – Complex processor architectures – Parallel computers – Multicore, multiprocessor machines • Theory vs. practice – Graphs, matrices, linear programs (Chap. 9 - ) – Undecidable • Design objectives for compiler optimizations – Correct – Performance improvement • speed, size, power consumption – Reasonable compilation time • For rapid development and debugging cycle – Manageable engineering effort • Prioritize optimizations Applications of Compiler Technology • Implementation of high-level programming languages • Optimizations for computer architectures • Design of new computer architectures • Program translations • Software productivity tools Implementation of high-level programming languages • Example: the register keyword in the C programming language – May lose efficiency, because programmers are often not the best judge of very low-level matters • Increased level of abstraction – User-defined aggregate data types: arrays, structures – High-level control flow: loops, procedure invocations • Object orientation – Data abstraction – Inheritance of properties • Java – – – – Type-safe Range checks for arrays Garbage collection Portable and mobile code Optimizations for computer architectures • Parallelism – Instruction-level • Explicit: VLIW machines such as Intel IA64 – Processor-level • Memory hierarchies – Registers, caches, physical memory, secondary storage Design of new computer architectures • RISC (Reduced Instruction-Set Computer) – PowerPC, SPARC, MIPS, Alpha, PA-RISC • CISC (Complex Instruction-Set Computer) – x86 • Specialized architectures – – – – – – Data flow machines VLIW machines SIMD arrays of processors Systolic arrays Multiprocessors with shared memory Multiprocessors with distributed memory Program translations • Binary translation – To translate the binary code for one machine to that of another – To provide backward compatibility • Hardware synthesis – Hardware description languages: Verilog and VHDL • Database query interpreters – Query languages: SQL (Structured Query Language) • Compiled simulation Software productivity tools • Type checking – Wrong type in an operation or parameters passed to a procedure • Bounds checking – Buffer overflow in C • Memory-management tools – Purify: a widely used tool to find memory management errors such as memory leaks in C or C++ Programming Language Basics • The static/dynamic distinction – Static policy: the issue can be determined at compile time – Dynamic policy: at run time – Scope of declarations • Static scope • Dynamic scope – Ex: in a Java class, • public static int x; Environments and states • Environment: a mapping from names to locations • States: a mapping from locations to values • Ex: – int i; void f(…) { int i; … i = 3; … } … x = i +1; • Names, Identifiers, and Variables • The environment and state mappings are dynamic, with a few exceptions: – Static binding of names to locations • E.g. global variable – Static binding of locations to values • E.g. declared constants – #define ARRAYSIZE 1000 Static Scope and Block Structure • Block: a grouping of declarations and statements – C: { } – Pascal: begin end • Ex: blocks in a C++ program – main () { int a = 1; int b = 1; { int b = 2; { int a = 3; cout << a << b; } { int b = 4; cout << a << b; } cout << a << b; } cout << a << b; } Explicit Access Control • Keywords like public, private, protected in object-oriented languages such as C++ or Java • Procedures, functions, methods Dynamic Scope • A use of a name x refers to the declaration of x in the most recently called procedure with such a declaration • Declarations and definitions • Ex: macro expansion in C preprocessor – #define a (x+1) int x = 2; void b{} { int x = 1; printf(“%d\n”, a); } void c{} { printf(%d\n”, a); } void main() { b(); c(); } • Ex: method resolution in object-oriented programming – Class C with a method named m() D is a subclass of C x.m(), where x is an object of class C Parameter Passing Mechanisms • Call-by-value – The actual parameter is evaluated or copied • Call-by-reference – The address of the actual parameter is passed to the called as the value of the corresponding formal parameter • Call-by-name – In Algo60, like a macro Aliasing • Ex: – A is an array belonging to a procedure p – P calls another procedure q(x, y) with a call q(a, a) – Parameters are passed by value – x and y are aliases End of Chapter 1