Compiler design - Kanat Bolazar

Compiler Design 1. Overview CIS 631, CSE 691, CIS400, CSE 400 Kanat Bolazar January 19, 2010 Compilers • Compilers translate from a source language (typically a high level language) to a functionally equivalent target language (typically the machine code of a particular machine or a machine-independent virtual machine). • Compilers for high level programming languages are among the larger and more complex pieces of software – Original languages included Fortran and Cobol • Often multi-pass compilers (to facilitate memory reuse) – Compiler development helped in better programming language desig • Early development focused on syntactic analysis and optimizatio – Commercially, compilers are developed by very large software grou • Current focus is on optimization and smart use of resources f Why Study Compilers? • General background information for good software engineer – Increases understanding of language semantics – Seeing the machine code generated for language constructs helps understand performance issues for languages – Teaches good language design – New devices may need device-specific languages – New business fields may need domain-specific language Applications of Compiler Technology & Tool • • • • • • • • Processing XML/other to generate documents, code, etc. Processing domain-specific and device-specific languages. Implementing a server that uses a protocol such as http or imap Natural language processing, for example, spam filter, search, document comprehension, summary generation Translating from a hardware description language to the schematic of a circuit Automatic graph layout (graphviz, for example) Extending an existing programming language Program analysis and improvement tools Dynamic Structure of a Compiler character stream va l = 10 * va l + i Front end (analysis) lexical analysis (scanning) token stream 1 ident "val" 3 assign - 2 number 10 4 times - 1 ident "val" syntax analysis (parsing) Statement syntax tree Expression Term 5 plus - 1 ident "i" token number token value Dynamic Structure of a Compiler Statement syntax tree Front end Expression Term ident = number * ident + ident semantic analysis (type checking, ...) intermediate representation syntax tree, symbol table, or three address code (TAC) ... optimization code generation machine code const 10 Back end (synthesis) Compiler versus Interpreter Compiler translates to machine code scanner parser ... code generator source code Interpreter loader machine code executes source code "directly" scanner • statements in a loop are scanned and parsed again and again parser source code interpretation Variant: interpretation of intermediate code ... compiler ... source code VM intermediate code • source code is translated i code of a virtual machine • VM interprets the code Static Structure of a Compiler parser & sem. analysis "main program" directs the whole compilation scanner code generation provides tokens from the source code generates machine code symbol table maintains information about declared names and types uses data flow Lexical Analysis • Stream of characters is grouped into tokens • Examples of tokens are identifiers, reserved words, integers, doubles or floats, delimiters, operators and special symbols int a; a = a + 2; int a ; a = a + 2 ; reserved word identifier special symbol identifier operator identifier operator integer constant special symbol Syntax Analysis or Parsing • Parsing uses a context-free grammar of valid programming language structures to find the structure of the input • Result of parsing usually represented by a syntax tree Example of grammar rules: expression → expression + expression | variable | constant variable → identifier constant → intconstant | doubleconstant | … Example parse tree: = a + Semantic Analysis • Parse tree is checked for things that violates the semantic rules of the language – Semantic rules may be written with an attribute grammar • Examples: – Using undeclared variables – Function called with improper arguments • Number and type of arguments – Array variables used without array syntax – Type checking of operator arguments – Left hand side of an assignment must be a variable (sometimes called an L-value) – ... Intermediate Code Generation • An intermediate code representation often helps contain complexity of compiler and discover code optimizations. • Typical choices include: – Annotated parse trees – Three Address Code (TAC), and abstract machine language – Bytecode, as in Java bytecode. Example statements: if (a <= b) { a = a – c; } Resulting TAC: _t1 = a > b if _t1 goto L0 _t2 = a – c a = _t2 L0: _t3 = b * c C = _t3 Intermediate Code Generation (cont'd) Example statements: if (a <= b) { a = a – c; } Java bytecode (javap -c): 55: iload_1 56: iload_2 57: if_icmpgt 64 Postfix/Polish/Stack: 60: 61: 62: 63: iload_1 iload_3 isub istore_1 v1 v2 JumpIf(>) v1 v3 – store(v1) v2 v3 * store(v3) 64: 65: 66: 67: iload_2 iload_3 imul istore_3 c=b*c Code Optimization • Compiler converts the intermediate representation to another one that attempts to be smaller and faster. • Typical optimizations: – – – – Inhibit code generation for unreachable segments Getting rid of unused variables Eliminating multiplication by 1 and addition by 0 Loop optimization: e.g. removing statements not modified in the loop – Common sub-expression elimination – ... Object Code Generation • The target program is generated in the machine language of the target architecture. – Memory locations are selected for each variable – Instructions are chosen for each operation – Individual tree nodes or TAC is translated into a sequence of machine language instructions that perform the same task • Typical machine language instructions include things like – – – – Load register Add register to memory location Store register to memory ... Object Code Optimization • It is possible to have another code optimization phase that transforms the object code into more efficient object code. • These optimizations use features of the hardware itself to make efficient use of processors and registers. – Specialized instructions – Pipelining – Branch prediction and other peephole optimizations  JIT (Just-In-Time) compilation of intermediate code (e.g. Java bytecode) can discover more context-specific optimizations not available earlier. Symbol Table • Symbol table management is a part of the compiler that interacts with several of the phases – Identifiers are found in lexical analysis and placed in the symbol table – During syntactical and semantical analysis, type and scope information is added – During code generation, type information is used to determine wha instructions to use – During optimization, the “live analysis” may be kept in the symbol table Error Handling • Error handling and reporting also occurs across many phases – Lexical analyzer reports invalid character sequences – Syntactic analyzer reports invalid token sequences – Semantic analyzer reports type and scope errors, and the like • The compiler may be able to continue with some errors, but other errors may stop the process Compiler / Translator Design Decisions • Choose a source language – Large enough to have many interesting language features – Small enough to implement in a reasonable amount of time – Examples for us: MicroJava, Decaf, MiniJava • Choose a target language – Either a real assembly language for a machine with an assembler – Or a virtual machine language with an interpreter – Examples for us: MicroJava VM (μJVM), MIPS (a popular RISC architecture, for which there is a “SPIM” simulator) • Choose an approach for implementation: – Either use an existing scanner and parser / compiler generator • lex/flex, yacc/bison/byacc, Antlr/JavaCC/SableCC/byaccj/Coco/R. Example MicroJava Program program P main program; no separate compilation final int size = 10; class Table { classes (without methods) int[] pos; int[] neg; } global variables Table val; { void main() int x, i; local variables { //---------- initialize val ---------val = new Table; val.pos = new int[size]; val.neg = new int[size]; i = 0; while (i < size) { val.pos[i] = 0; val.neg[i] = 0; i = i + 1; } //---------- read values ---------read(x); while (x != 0) { if (x > 0) val.pos[x] = val.pos[x] + 1; else if (x < 0) val.neg[-x] = val.neg[-x] + 1; read(x); } } References • Original slides: Nancy McCracken. • Niklaus Wirth, Compiler Construction, chapters 1 and 2 • Course notes from H. Mossenback, System Specification and Compiler Construction, http://www.ssw.uni-linz.ac.at/Misc/CC/ – Also notes on MicroJava • Course notes from Jerry Cain, Compilers, http://www.stanford.edu/class/cs143/ • General references: – Aho, A., Lam, M., Sethi, R., Ullman, J., Compilers: Principles, Techniques and Tools, 2nd Edition, Addison-Wesley, 2006. – Steven Muchnik, Advanced Compiler Design and Implementation, Morgan-Kaufmann, 1997. – Keith Cooper and Linda Torczon, Engineering a Compiler, Morgan Compiler Design 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010 Contents In these slides we will see: 1. Introduction, Concepts and Notations 2. Regular Expressions, Regular Languages 3. RegExp Examples 4. Finite-State Automata (FSA/FSM) 1. 2. 3. 4. Introduction, Concepts and Notations Regular Expressions, Regular Languages RegExp Examples Finite-State Automata (FSA/FSM) Introduction • Regular expressions are equivalent to Finite State Automata in recognizing regular languages, the first step in the Chomsky hierarchy formal languages • The term regular expressions is also used to mean the extended set of string matching expressions used in many modern languages – Some people use the term regexp to distinguish this use • Some parts of regexps are just syntactic extensions of regular expressio and can be implemented as a regular expression – other parts are significant extensions of the power of the language and are not equivale to finite automata Concepts and Notations • Set: An unordered collection of unique elements S1 = { a, b, c } S2 = { 0, 1, …, 19 } empty set: membership: x  S union: S1  S2 = { a, b, c, 0, 1, …, 19 } universe of discourse: U subset: S1  U complement: if U = { a, b, …, z }, then S1' = { d, e, …, z } = U - S1 • Alphabet: A finite set of symbols – Examples: • Character sets: ASCII, ISO-8859-1, Unicode •  = { a, b } 2 = { Spring, Summer, Autumn, Winter } • String: A sequence of zero or more symbols from an alphabet – The empty string:  Concepts and Notations • Language: A set of strings over an alphabet – Also known as a formal language; may not bear any resemblance to a natural language, but could model a subset of one. – The language comprising all strings over an alphabet  is written as: * • Graph: A set of nodes (or vertices), some or all of which may be connected by edges. – – A directed graph example: An example: 1 3 2 a b c 1. 2. 3. 4. Introduction, Concepts and Notations Regular Expressions, Regular Languages RegExp Examples Finite-State Automata (FSA/FSM) Regular Expressions • A regular expression defines a regular language over an alphabet : –  is a regular language: // – Any symbol from is a regular language:  = { a, b, c} /a/ /b/ /c/ – Two concatenated regular languages is a regular language:  = { a, b, c} /ab/ /bc/ /ca/ Regular Expressions • Regular language (continued): – The union (or disjunction) of two regular languages is a regular language:  = { a, b, c} /ab|bc/ /ca|bb/ – The Kleene closure (denoted by the Kleene star: *) of a regular language is a regular language:  = { a, b, c} /a*/ /(ab|ca)*/ – Parentheses group a sub-language to override operator precedence (and, we’ll see later, for “memory”). RegExps – The extended use of regular expressions is in many modern languages: • Perl, php, Java, python, … – Can use regexps to specify the rules for any set of possible strings you want t match • Sentences, e-mail addresses, ads, dialogs, etc – ``Does this string match the pattern?'', or `Ìs there a match for the pattern anywhere in this string?'' – Can also define operations to do something with the matched string, such as extract the text or substitute for it – Regular expression patterns are compiled into a executable code within the language Regular Expressions: Basics Some examples and shortcuts: /[abc]/ = /a|b|c/ Character class; disjunction /[b-e]/ = /b|c|d|e/ Range in a character class /[\012\015]/ = /\n|\r/ Octal characters; special escapes /./ = /[\x00-\xFF]/ Wildcard; hexadecimal characters /[^b-e]/ = /[\x00-af-\xFF]/ Complement of character class /a*/ Kleene star: zero or more /[af]*/ /a?/ = /a|/ /a+/ /a{8}/ /(abc)*/ /(ab|ca)?/ /([a-zA-Z]1|ca)+/ /b{1,2}/ /c{3,}/ Zero or one Kleene plus: one or more Counters: repetition quantification Regular Expressions: Anchors • Anchors constrain the position(s) at which a pattern may match. • Think of them as “extra” alphabet symbols, though they always consume/match  (the zero-length string): /â/ Pattern must match at beginning of string /a$/ Pattern must match at end of string /\bword23\b/ “Word” boundary: or /[a-zA-Z0-9_][â-zA-Z0-9_]/ 'x ' '0%' /[â-zA-Z0-9_][a-zA-Z0-9_]/ ' x' '%0' /\B23\B/ “Word” non-boundary Regular Expressions: Escapes • There are six classes of escape sequences ('\XYZ'): 1. Numeric character representation: the octal or hexadecimal position in a character set: “\012” = “\xA” 2. Meta-characters: The characters which are syntactically meaningful to regular expressions, and therefore must be escaped in order to represent themselves in the alphabet of the regular expression: “[](){}|^$.?+*\” (note the inclusion of the backslash). 3. “Special” escapes (from the “C” language): newline: “\n” = “\xA” carriage ret: “\r” = “\xD” tab: “\t” = “\x9” formfeed: “\f” = “\xC” Regular Expressions: Escapes (cont'd) 4. Aliases: shortcuts for commonly used character classes. (Note that the capitalized version of these aliases refer to the complement of the alias’s character class): whitespace: “\s” = “[ \t\r\n\f\v]” digit: “\d” = “[0-9]” word: “\w” = “[a-zA-Z0-9_]” non-whitespace: “\S” = “[^ \t\r\n\f]” non-digit: “\D” = “[^0-9]” non-word: “\W” = “[â-zA-Z0-9_]” 5. Memory/registers/back-references: “\1”, “\2”, etc. 6. Self-escapes: any character other than those which have special meaning can be escaped, but the escaping has no effect: the character still represents the regular language of the character itself. Regular Expressions: Back References • Memory/Registers/Back-references – Many regular expression languages include a memory/register/back-refere feature, in which sub-matches may be referred to later in the regular expression, and/or when performing replacement, in the replacement strin • Perl: /(\w+)\s+\1\b/ matches a repeated word • Python: re.sub(”(the\s+)the(\s+|\b)”,”\1”,string) remo the second of a pair of ‘the’s – Note: finite automata cannot be used to implement the memory featur 1. 2. 3. 4. Introduction, Concepts and Notations Regular Expressions, Regular Languages RegExp Examples Finite-State Automata (FSA/FSM) Regular Expression Examples Character classes and Kleene symbols [A-Z] = one capital letter [0-9] = one numerical digit [st@!9] = s, t, @, ! or 9 [A-Z] matches G or W or E does not match GW or FA or h or fun [A-Z]+ = one or more consecutive capital letters matches GW or FA or CRASH [A-Z]? = zero or one capital letter [A-Z]* = zero, one or more consecutive capital letters matches on eat or EAT or I so, [A-Z]ate matches: Gate, Late, Pate, Fate, but not GATE or gate and [A-Z]+ate matches: Gate, GRate, HEate, but not Grate or grate or STATE and [A-Z]*ate matches: Gate, GRate, and ate, but not STATE, grate or Plate Regular Expression Examples (cont’d) [A-Za-z] = any single letter so [A-Za-z]+ matches on any word composed of only letters, but will not match on “words”: bi-weekly , yes@SU or IBM325 they will match on bi, weekly, yes, SU and IBM a shortcut for [A-Za-z] is \w, which in Perl also includes _ so (\w)+ will match on Information, ZANY, rattskellar and jeuvbaew \s will match whitespace so (\w)+(\s)(\w+) will match real estate or Gen Xers Regular Expression Examples (cont’d) Some longer examples: ([A-Z][a-z]+)\s([a-z0-9]+) matches: Intel c09yt745 but not IBM series5000 [A-Z]\w+\s\w+\s\w+[!] matches: The dog died! It also matches that portion of “ he said, “ The dog died! “ [A-Z]\w+\s\w+\s\w+[!]$ matches: The dog died! But does not match “he said, “ The dog died! “ because the $ indicates end of Line, and there is a quotation mark before the end of the line (\w+ats?\s)+ parentheses define a pattern as a unit, so the above expression will match: Fat cats eat Bats that Splat 1. 2. 3. 4. Introduction, Concepts and Notations Regular Expressions, Regular Languages RegExp Examples Finite-State Automata (FSA/FSM) Finite State Automata • Finite State Automaton a.k.a. Finite Automaton, Finite State Machine, FSA or FSM – An abstract machine which can be used to implement regular expressions (etc.). – Has a finite number of states, and a finite amount of memory (i.e., the current state). – Can be represented by directed graphs or transition tables Finite-State Automata • Representation – An FSA may be represented as a directed graph; each node (or vertex) represents a state, and the edges (or arcs) connecting the nodes represent transitions. – Each state is labelled. – Each transition is labelled with a symbol from the alphabet over which the regular language represented by the FSA is defined, or with , the empty string. – Among the FSA’s states, there is a start state and at least one final state (or accepting state). Finite-State Automata state  = { a, b, c } q0 a q1 b q2 c q3 a q4 final state start state transition • Representation (continued) – An FSA may also be represented with a statetransition table. The table for the above FSA: Input State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata • Given an input string, an FSA will either accept or reject the input. – If the FSA is in a final (or accepting) state after all input symbols have been consumed, then the string is accepted (or recognized). – Otherwise (including the case in which an input symbol cannot be consumed), the string is rejected. Finite-State Automata  = { a, b, c } Input a q0 IS1: a b q1 b c q2 c q3 a q4 a IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 IS1: a b q1 b c q2 c q3 a q4 a IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 IS1: a b q1 b c q2 c q3 a State q4 a IS2: c c b a IS3: a b c a c b c 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 b IS1: c q2 c q3 a q4 a IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 b IS1: c q2 c q3 a q4 a IS2: c c b a IS3: a b c a c State a c 0 1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 c q2 c IS1: q3 a q4 a IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 c q2 c IS1: q3 a q4 State a b 0 1  1  2 a IS2: c c b a IS3: a b c a 3 c 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 c q2 q3 a q4 a IS1: IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 b q1 c q2 q3 a q4 a IS1: IS2: c c b a IS3: a b c a State b c 0   1 2  2  3 4      4 c Finite-State Automata  = { a, b, c } Input a q0 b q1 c q2 q3 a q4 IS1: IS2: c c b a IS3: a b c a c State a b c 0 1   1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 IS1: a b q1 b c q2 c q3 a State q4 c c b a IS3: a b c a b  a IS2: a c 1  2  2   3 3 4   4    Finite-State Automata  = { a, b, c } Input a q0 IS1: IS2: IS3: a c b q1 b c c q2 c b q3 a q4 a a State a b 0 1  1  2 2   3 4   c Finite-State Automata • An FSA defines a regular language over an alphabet : –  is a regular language: – Any symbol from is a regular language: q0  = { a, b, c} – Two concatenated regular languages is a regular language: q0 b q1  = { a, b, c} q0 b q1 q0 b q0 q1 c c q2 q1 Finite-State Automata • regular language (continued): – The union (or disjunction) of two regular languages is a regular language:  = { a, b, c} q0 b q1 q0 q1  q0  q2 c q1 b c q3 – The Kleene closure (denoted by the Kleene star: *) of a regular language is a regular language:  = { a, b, c}  q0 b q1  Finite-State Automata • Determinism – An FSA may be either deterministic (DFSA or DFA) or nondeterministic (NFSA or NFA). • An FSA is deterministic if its behavior during recognition is fully determined by the state it is in and the symbol to be consumed. – I.e., given an input string, only one path may be taken through the FSA. • Conversely, an FSA is non-deterministic if, given an input string, more than one path may be taken through the FSA. – One type of non-determinism is -transitions, i.e. transitions which consume the empty string (no symbols). NFA: Nondeterministic FSA • An example NFA: State  = { a, b, c }  q0 a q1 b  q2 c q3 a q4 Input a b c  0 1    1  2  2 2   3,4  3 4    4     c • The above NFA is equivalent to the regular expression /ab*ca?/. Nondeterministic FSA • String recognition with an NFA: – Backup (or backtracking): remember choice points and revisit choices upon failure – Look-ahead: choose path based on foreknowlege about the input string and available paths – Parallelism: examine all choices simultaneously Finite-State Automata • Conversion of NFAs to DFAs – Every NFA can be expressed as a DFA.  a q0 State q1 q2  Input 0 b c q3  = { a, b, c } a q4 /ab*ca?/ c New State a b c  1    Subset construction State Input a b c 0' 0 1   1' 1  2 {3,4} 1  2  2 2   3,4  2' 2  2 {3,4} 3 4    3'F {3,4}F 4   4F     4'F 4F    5     b,c q0' a a q1' c a b b q2' b,c c q3' a q4' a,b,c q5 a,b,c Finite-State Automata • DFA minimization – Every regular language has a unique minimum-state DFA. – The basic idea: two states s and t are equivalent if for every string w, the transitions T(s, w) and T(t, w) are both either final or non-final. – An algorithm: • Begin by enumerating all possible pairs of both final or both nonfinal states, then iteratively removing those pairs the transition pair for which (for any symbol) are either not equal or are not on the list. The list is complete when an iteration does not remove any pairs from the list. • The minimum set of states is the partition resulting from the unions of the remaining members of the list, along with any original states not on the list. Finite-State Automata • The minimum-state DFA for the DFA converted from the NFA for /ab*ca?/, without the “failure” state (labeled “5”), and with the states relabeled to the set Q = { q0", q1", q2", q3" }: b q0" q1" a c q2" a q3" FSA Recognition As Search • Recognition as search – Recognition can be viewed as selection of the correct path from all possible paths through an NFA (this set of paths is called the state-space) – Search strategy can affect efficiency: in what order should the paths be searched? • Depth-first (LIFO [last in, first out]; stack) • Breadth-first (FIFO [first in, first out]; queue) • Depth-first uses memory more efficiently, but may enter into an infinite loop under some circumstances Finite-State Automata with Output • Finite State Automata may also have an output alphabet and an action at every state that may output an item from the alphabet • Useful for lexical analyzers – As the FSA recognizes a token, it outputs the characters – When the FSA reaches a final state and the token is complete, the lexical analyzer can use • Token value – output so far • Token type – label of the output state Conclusion • Both regular expressions and finite-state automata represent regular languages. • The basic regular expression operations are: concatenation, union/disjunction, and Kleene closure. • The regular expression language is a powerful pattern-matching tool. • Any regular expression can be automatically compiled into an NFA, to a DFA, and to a unique minimum-state DFA. • An FSA can use any set of symbols for its alphabet, including letters and words. References • Original slides: – Steve Rowe at Center for NLP – Nancy McCracken Compiler Design 3. Lexical Analyzer, Flex Kanat Bolazar January 26, 2010 Lexical Analyzer • The main task of the lexical analyzer is to read the input source program, scanning the characters, and produce a sequence of tokens that the parser can use for syntactic analysis. • The interface may be to be called by the parser to produce one token at a time – Maintain internal state of reading the input program (with lines) – Have a function “getNextToken” that will read some characters at the current state of the input and return a token to the parser • Other tasks of the lexical analyzer include – Skipping or hiding whitespace and comments – Keeping track of line numbers for error reporting • Sometimes it can also produce the annotated lines for error reports Character Level Scanning • The lexical analyzer needs to have a well-defined valid character set – Produce invalid character errors – Delete invalid characters from token stream so as not to be used in the parser analysis • E.g. don’t want invisible characters in error messages • For every end-of-line, keep track of line numbers for error reporting • Skip over or hide whitespace and comments – If comments are nested (not common), must keep track of nesting to find end of comments – May produce hidden tokens, for convenience of scanner structure • Always produce an end-of-file token Tokens, Token Types and Values • The set of tokens is typically something like the following table – Or may have separate token types for different operators or reserved words Token Description –Type May wantToken to Value keep lineInformal number with each token Integer constant Numeric value Numbers like 3, -5, 12 without decimal pts. Floating constant Numeric value Numbers like 3.0, -5.1, 12.2456789 Reserved word Word string Symbol table index Words like if, then, class, … Words not reserved starting with letter or _ and containing only letters, _, and digits Relations Operator string <, <=, ==, … Operators Operator string =, +, - , ++, … Char constant Char value ‘A’, … String String “this is a string”, … Identifiers Hidden: end-of-line Hidden: comment Token Actions • Each token recognized can have an action function – Many token types produce a value • In the case of numeric values, make sure property numeric errors produced, e.g. integer overflow – Put identifiers in the symbol table • Note that at this time, no effort is made to distinguish scope; there will be one symbol table entry for each identifier – Later, separate scope instances will be produced • Other types of actions – End-of-line (can be treated as a token type that doesn’t output to the parser) • Increment line number • Get next line of input to scan Testing • Execute lexical analyzer with test cases and compare results with expected results • Test cases – Exercise every part of lexical analyzer code – Produce every error message – Don’t have to be valid programs – just valid sequence of tokens Lex and Yacc • Two classical tools for compilers: – Lex: A Lexical Analyzer Generator – Yacc: “Yet Another Compiler Compiler” • Lex creates programs that scan your tokens one by one. • Yacc takes a grammar (sentence structure) and generates a parser. Input Lexical Rules Grammar Rules Lex Yacc yylex() yyparse() Parsed Input Flex: A Fast Scanner Generator • Often, instead of the standard Lex and Yacc, Flex and Bison are used: – Flex: A fast lexical analyzer – (GNU) Bison: A drop-in replacement for (backwards compatible with) Yacc • Resources: – http://en.wikipedia.org/wiki/Flex_lexical_analyser – http://en.wikipedia.org/wiki/GNU_Bison – http://dinosaur.compilertools.net/ (the Lex & Yacc Page) Flex Example 1: Delete This • Shortest Flex example, “deletethis.l”: %% deletethis • This scanner will match and not echo (default behavior) the word “deletethis”. • Compile and run it: $ flex deletethis.l # creates lex.yy.c $ gcc -o scan lex.yy.c -lfl # fl: flex library $ ./scan This deletethis is not deletethis useful. This is not useful. ^D Flex Example 2: Replace This • Another very short Flex example, “replacer.l”: %% replacethis printf(“replaced”); • This scanner will match “replacethis” and replace it with “replaced”. • Compile and run it: $ flex -o replacer.yy.c replacer.l $ gcc -o replacer replacer.yy.c -lfl $ ./replacer This replacethis is not very replacethis useful. This replaced is not very replaced useful. Please dontreplacethisatall. Please dontreplacedatall. Flex Example 3: Common Errors • Let's replace “the the” with “the”: %% the the printf(“the”); uhh • Unfortunately, this does not work: The second “the” is considered part of C code: %% the the printf(“the”); • Also, the open and close matching double quotes used in documents will give errors, so you must always replace: “the” → "the" Flex Example 3: Common Errors, cont'd • You discover such errors when you compile the C code, not when you use flex: $ flex -o errors.yy.c errors.l $ gcc -o errors errors.yy.c -lfl errors.l: In function ‘yylex’: errors.l:2: error: ‘the’ undeclared ... • The error is reported back in our errors.l file, but we can also find it in errors.yy.c: case 1: YY_RULE_SETUP #line 2 "errors.l" <-- For error reporting the <-- the ? printf("the"); not C code Flex Example 4: Replace Duplicate • Let's replace “the the” with “the”: %% "the the" printf("the"); • This time, it works: $ flex -o duplicate.yy.c duplicate.l $ gcc -o duplicate duplicate.yy.c -lfl $ ./duplicate This is the the file. This is the file. This is the the the file. This is the the file. Lathe theory Latheory Flex Example 4: Replace And Delete • Let's replace “the the” with “the” and delete “uhh”: %% "the the" printf("the"); uhh • Run as before: This uhh is the the uhhh file. This is the h file. • Generally, lexical rules are pattern-action pairs: %% pattern1 action1 (C code) pattern2 action2 ... Flex File Structure • In Lex and Flex, the general rule file structure is: definitions %% rules %% user code • Definitions: DIGIT [0-9] ID [a-z][a-z0-9]* • can be used later in rules with {DIGIT}, {ID}, etc: {DIGIT}+"."{DIGIT}* • This is the same as: Flex Example 5: Count Lines int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars; . ++num_chars; %% main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", Some Regular Expressions for Flex • • • • • • • \"[^"]*\" string "\t"|"\n"\" " whitespace (most common forms) [a-zA-Z] [a-zA-Z_][a-zA-Z0-9_]* identifier: allows a, aX, a45__ [0-9]*"."[0-9]+ allows .5 but not 5. [0-9]+"."[0-9]* allows 5. but not .5 [0-9]*"."[0-9]* allows . by itself !! Resources • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006. (The “purple dragon book”) • Flex Manual. Available as single postscript file at the Lex and Yacc page online: – http://dinosaur.compilertools.net/#flex – http://en.wikipedia.org/wiki/Flex_lexical_analyser Compiler Design 4. Language Grammars Kanat Bolazar January 28, 2010 Introduction to Parsing: Language Grammars • Programming language grammars are usually written as some variation of Context Free Grammars (CFG)s • Notation used is often BNF (Backus-Naur form): <block> -> { <statementlist> } <statementlist> -> <statement> ; <statementlist> <statement> -> <assignment> ; | if ( <expr> ) <block> else <block> | while ( <expr> ) <block> ... Example Grammar: Language 0+0 • A language that we'll call "Language 0+0": E -> E + E | 0 • Equivalently: E -> E + E E -> 0 • • Note that if there are multiple rules for the same left hand side, they are alternatives. This language only contains sentences of the form: 0 • 0+0 0+0+0 0+0+0+0 ... Derivation for 0+0+0: E -> E + E -> E + E + E -> 0 + 0 + 0 • Note: This language is ambiguous: In the second step, Example Grammar: Arithmetic, Ambiguous • Arithmetic expressions: Exp -> num | Exp Operator Exp Op -> + | - | * | / | % • The "num" here represents a token. What it corresponds to is defined in the lexical analyzer with a regular expression: num • This langugage allows: 45 • 35 + 257 * 5 - 2 ... This language as defined here is ambiguous: 2+5*7 • [0-9]+ Exp * 7 or 2 + Exp ? Depending on the tools you use, you may be able to • Example Language: Arithmetic, Factored Arithmetic expressions grammar, factored for operator precedence: Exp -> Factor | Factor Addop Exp Factor -> num | num Multop Factor Addop -> + | Multop -> * | / | % • This langugage also allows the same sentences: 45 • 35 + 257 * 5 - 2 ... This language is not ambiguous; it first groups factors: 2+5*7 Factor Addop Exp num + Exp num + Factor Grammar Definitions • The grammar is a set of rules, sometimes called productions, that construct valid sentences in the language. • Nonterminal symbols represent constructs in the language. These would be the phrases in a natural language. • Terminal symbols are the actual words of the language. These are the tokens produced by the lexical analyzer. In a natural language, these would be the words, symbols, and space. • A sentence in the language only contains terminal symbols. Rules, Nonterminal and Terminal Symbols • Arithmetic expressions grammar, using multiplicative factors for operator precedence: Exp -> Factor | Factor Addop Exp Factor -> num | num Multop Factor Addop -> + | Multop -> * | / | % • This langugage has four rules as written here. If we expand each option, we would have 2 + 2 + 2 + 3 = 9 rules. • There are four nonterminals: Exp Factor Addop Multop • There are six terminals (tokens): num + - * / % Grammar Definitions: Rules • The production rules are rewrite rules. The basic CFG rule form is: X -> Y1 Y2 Y3 … Yn where X is a nonterminal and the Y’s may be nonterminals or terminals. • There is a special nonterminal called the Start symbol. • The language is defined to be all the strings that can be generated by starting with the start symbol, repeatedly replacing nonterminals by the rhs of one of its rules until there are no more nonterminals. Larger Grammar Examples • We'll look at language grammar examples for MicroJava and Decaf. • Note: Decaf extends the standard notation; the very useful { X }, to mean X | X, X | X, X, X | ... is not standard. Parse Trees • Derivation of a sentence by the language rules can be used to construct a parse tree. • We expect parse trees to correspond to meaningful semantic phrases of the programming language. • Each node of the parse tree will represent some portion that can be implemented as one section of code. • The nonterminals expanded during the derivation are trunk/branches in the parse tree. • The terminals at the end of branches are the leaves of the parse tree. Parsing • A parser: – Uses the grammar to check whether a sentence (a program for us) is in the language or not. – Gives syntax error If this is not a proper sentence/program. – Constructs a parse tree from the derivation of the correct program from the grammar rules. • Top-down parsing: – Starts with the start symbol and applies rules until it gets the desired input program. • Bottom-up parsing: – Starts with the input program and applies rules in reverse until it can get back to the start symbol. – Looks at left part of input program to see if it matches the rhs of a rule. Parsing Issues • Derivation Paths = Choices – Naïve top-down and bottom-up parsing may require backtracking to find a correct parse. – Restrictions on the form of grammar rules to make parsing deterministic. • Ambiguity – One program may have two different correct derivations from the grammar. – This may be a problem if it implies two different semantic interpretations. – Famous examples are arithmetic operators and the dangling else problem. Ambiguity: Dangling Else Problem • Which if does this else associate with? if X if Y find() else getConfused() • The corresponding ambiguous grammar may be: IfSttmt -> if Cond Action | if Cond Action else Action • Two derivations at top (associated with top "if") are: if Cond Action • if Cond Action else Action Programming languages often associate else with the Resources • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006. • Compiler Construction Course Notes at Linz: http://www.ssw.uni-linz.ac.at/Misc/CC/ • CS 143 Compiler Course at Stanford: http://www.stanford.edu/class/cs143/ Compiler Design 5. Top-Down Parsing with a Recursive Descent Parser Kanat Bolazar February 2, 2010 Parsing • Lexical Analyzer has translated the source program into a sequence of tokens • The Parser must translate the sequence of tokens into an intermediate representation – Assume that the interface is that the parser can call getNextToken to get the next token from the lexical analyzer – And the parser can call a function called emit that will put out intermediate representations, currently unspecified • The parser outputs error messages if the syntax of the source program is wrong Parsing: Top-Down, Bottom-Up • Given a grammar such as: E -> 0 | E + E • And a string to parse such as "0 + 0" • A parser can parse top-down, from start symbol (E above): E -> E+E -> 0 + E -> 0 + 0 • Or parse bottom-up, grouping terminals into RHS of rules: 0+0 <- E + 0 <- E + E <- E • Usually, parsing is done as tokens are read in: – Top-down: • After seeing 0, we don't yet know which rule to use; Parsing: Top-Down, Bottom-Up • Generally: – top-down is easier to understand, implement directly – bottom-up is more powerful, allowing more complicated grammars – top-down parsing may require changes to the grammar • Top-down parsing can be done: – programmatically (recursive descent) – by table lookup and transitions • Bottom-up parsing requires table-driven parsing • If the grammar is not complicated, the simplest approach is to implement a recursive-descent parser. • A recursive descent parser does not require backtracking Recursive Descent Parsing • For every BNF rule (production) of the form <phrase1>  E the parser defines a function to parse phrase1 whose body is to parse the rule E void parsePhrase1( ) { /* parse the rule E */ } • Where E consists of a sequence of non-terminal and terminal symbols • Requires no left recursion in the grammar. Parsing a rule • A sequence of non-terminal and terminal symbols, Y1 Y2 Y3 … Yn is recognized by parsing each symbol in turn • For each non-terminal symbol, Y, call the corresponding parse function parseY • For each terminal symbol, y, call a function expect(y) that will check if y is the next symbol in the source program – The terminal symbols are the token types from the lexical analyzer – If the variable currentsymbol always contains the next token: expect(y): if (currentsymbol == y) Simple parse function example • Suppose that there was a grammar rule <program>  ‘class’ <classname> ‘{‘ <field-decl> <method-decl> ‘}’ • Then: parseProgram(): expect(‘class’); parseClassname(); expect(‘{‘); parseFieldDecl(); parseMethodDecl(); expect(‘}’); Look-Ahead • In general, one non-terminal may have more than one production, so more than one function should be written to parse that non-terminal. • Instead, we insist that we can decide which rule to parse just by looking ahead one symbol in the input <sentence> -> 'if' '(' <expr> ')' <block> | 'while' '(' <expr> ')' <block> ... • Then parseSentence can have the form if (currentsymbol == "if") ... // parse first rule elsif (currentsymbol == "while") ... // parse second rule First and Follow Sets • First(E), is the set of terminal symbols that may appear at the beginning of a sentence derived from E – And may also include  if E can generate an empty string • Follow(<N>), where <N> is a non-terminal symbol in the grammar, is the set of terminal symbols that can follow immediately after any sentence derived from any rule of N • In this grammar: E -> 0 | E + E • First(0) = {0} First(E + E) = {0} First(E) = {0} • Follow(E) = {+, EOF} Grammar Restriction 1 Grammar Restriction 1 (for top-down parsing): • The First sets of alternative rules for the same LHS must be different (so we know which path to take upon seeing a first terminal symbol/token). • Notice: This is not true in the grammar above. Upon seeing 0 we don't know if we should take 0 or E + E path. Recognizing Possibly Empty Sentences • In a strict context free grammar, there may be rules in which the rhs is , the empty string • Or in an extended BNF grammar, there may be the specification that some part of the rhs of the rule occurs 0 or 1 times <phrase1>  … [ <phrase2> ] … • Then we recognize the possibly empty occurrence of phrase2 by if (currentsymbol is in First(<phrase2>)) then parsePhrase2() Recognizing Sequences • In a context free grammar, you often have rules that specify any number of a phrase can occur <arglist>  <arg> <arglist> | e • In extended BNF, we replace this with the * to indicate 0 or more occurrences <arg> * • We can recognize these sequences by using iteration. If there is a rule of the form <phrase1>  … <phrase2>* … we can recognize the phrase2 occurrences by while (currentsymbol is in First(<phrase2>)) do parsePhrase2() Grammar Restriction 2 • In either of the previous cases, where the grammar symbol may generate sentences which are empty, the grammar must be restricted – suppose that <phrase2> is the symbol that can occur 0 times – require that the sets First(<phrase2>) and Follow(<phrase2) be disjoint Grammar Restriction 2: • If a nonterminal may occur 0 times, its First and Follow sets must be different (so we know whether to parse it or skip it on seeing a terminal symbol/token). Multiple Rules • Suppose that there is a nonterminal symbol with multiple rules where each rhs is nonempty <phrase1>  E1 | E2 | E3 | . . . | En then we can write ParsePhrase1 as follows: if (currentsymbol is in First( E1 )) then ParseE1 elsif (currentsymbol is in First( E2 )) then ParseE2 ... elsif (currentsymbol is in First( En )) then ParseEn else Syntax Error • If any rhs can be empty, then don’t give the syntax error • Remember the first grammar restriction: – The sets First( E1 ), … , First( En ) must be disjoint Example Expression Grammar • Suppose that we have a grammar <expr>  <term> { <op> <term> }* <term>  ‘const’ | ‘(‘ <expr> ‘)’ <op>  ‘+’ | ‘-’ • Parsing functions: void parseTerm ( ) void parseExpr ( ) { if (cursym == ‘const’) then getNextToken() { parseTerm(); else if (cursym == ‘(‘) while (cursym in First(<op>)) then { getNextToken(); { getNextToken(); parseExpr(); parseTerm(); expect( ‘)’ ) } } } } First Sets • Here we give a more formal, and more detailed, definition of a First set, starting with any non-terminal. – If we have a set of rules for a non-terminal, <phrase1> <phrase1>  E1 | E2 | E3 | . . . | En then First(<phrase1>) = First(E1)+ . . . + First(En ) (set union) – For any right hand side, Y1 Y2 Y3 … Yn , we make cases on the form of the rule • First(aY2 Y3 … Yn) = a , for any terminal symbol a • First(N Y2 Y3 … Yn) = First(N), for any non-terminal N that does not generate the empty string • First([N]M) = First(N) + First(M) (0 or 1 occurrence of N) • First({N}*M) = First(N) + First(M) (0 or more Follow Sets • To define the set Follow(T), examine the cases of where the non-terminal T may appear on the rhs of a rule in the grammar. – N  S T U or N S [T] U or N  S {T}* U • If U never generates an empty string, then Follow(T) includes First(U) • If U can generate an empty string, then Follow(T) includes First(U) and Follow(N) – N  S T or N  S [ T ] or N  S { T }* • Follow(T) includes Follow(N) – The Follow set of the start symbol should contain EOT, the end of text marker • Include the Follow set of all occurrences of T from the rhs of rules to make the set Follow(T) Simple Error Recovery • To enable the parser to keep parsing after a syntax error, the parser should be able to skip symbols until it finds a “synchronizing symbol”. – E.g. in parsing a sequence of declarations or statements, skipping to a ‘;’ should enable the parser to start parsing the next declaration or statement General Error Recovery • A more general technique allows the syntax error routine to be given a list of symbols that it should skip to. void syntaxError(String msg, Symbols StopSymbols) { give error with msg; while (! currentsymbol in StopSymbols) { getNextSymbol } } – assuming that there is a type called Symbols of terminal symbols – we may want to pass an error code instead of a message • Each recursive descent procedure should also take StopSymbols as a parameter, and may modify these to pass to any procedure that it calls Stop Symbols • If the parser is trying to parse the rhs E of a non-terminal NE then the stop symbols are those symbols which the parser is prepared to recognize after a sentence generated by E – Remove anything ambiguous from Follow(N) • The stop symbols should always also contain the end of text symbol, EOT, so that the syntax error routine never tries to skip over symbols past the end of the program. Compiler Design 7. Top-Down Table-Driven Parsing Kanat Bolazar February 9, 2010 Table Driven Parsers • Both top-down and bottom-up parsers can be written that explicitly manage a stack while scanning the input to determine if it can be correctly generated from the grammar productions – In top-down parsers, the stack will have non-terminals which can be expanded by replacing it with the right-hand-side of a production – In bottom-up parsers, the stack will have sequences of terminals and non-terminals which can be reduced by replacing it with the non-terminal for which it is the rhs of a production • Both techniques use a table to guide the parser in deciding what production to apply, given the top of the stack and the next input Top-down and Bottom-up Parsers • Predictive parsers are top-down, non-backtracking – Sometimes called LL(k) • Scan the input from Left to right • Generates a Leftmost derivation from the grammar • k is the number of lookahead symbols to make parsing deterministic – If a grammar is not in an LL(k) form, removing left recursion and doing left-factoring may produce one • Not all context free languages can have an LL(k) grammar • Shift-reduce parsers are bottom-up parsers, sometimes called LR(k) – Scan the input from Left to Right – Produce a Rightmost derivation from the grammar • Not all context free languages have LR grammars (Non-recursive) Predictive Parser • Replaces non-terminals on the top of the stack with the rhs of a production that can match the next input. Input: Stack: X Y Z eot a + b eot Predictive Parsing Program Parsing Table: M Output Parsing Algorithm • The parser starts in a configuration with S <eot> on the stack Repeat: let X be the top of stack symbol, a is the next symbol if X is a terminal symbol or <eot> if (X = a) then pop X from the stack and getnextsym else error else // X is a non-terminal if M[X, a] = X ->Y1 Y2 … Yk { pop X from the stack; push Y1 Y2 … Yk on the stack with Y1 on top; output the production } else error Until stack is empty Example from the dragon book (Aho et al) • The expression grammar E E + T | T T T * F | F F  ( E ) | id • Can be rewritten to eliminate the left recursion E  T E’ E’  + T E’ |  T  F T’ T’  * F T’ |  F  ( E ) | id Parsing Table • The table is indexed by the set of non-terminals in one direction and the set of terminals in the other • Any blank entries represent error states • Non LL(1) grammars could have more than one rule in a table id + * ( ) Eot entry E E T E’ E’+ T E’ E’ T T’  F  id E  E  T’  T’  TF T’ T  F T’ T’ F E T E’ T’  F T’ F(E ) Stack $E $E’T $E’T’F $E’T’id $E’T’ $E’ $E’T + $E’T $E’T F $E’T’ id $E’T’ $E’T’ F * $E’T’ F $E’T’ id $E’T’ $E’ $ Input Output id + id * id $ id + id * id $ id + id * id $ id + id * id $ + id * id $ + id * id $ + id * id $ id * id $ id * id $ id * id $ * id $ * id $ id $ id $ $ $ $ E  T E’ T  F T’ F  id T’  e E’  + T’ E’ T  F T’ F  id T’  * F T’ F  id T’  e E’  e Constructing the LL parsing table • For each production of the form N  E in the grammar – For each terminal a in First(E), add N  E to M[N, a] – If  is in First(E), for each terminal b in Follow(N), add N  E to M[N, b] – If  is in First(E) and eot is in Follow(N), add N  E to M[N, eot] – All other entries are errors Compiler Design 9. Table-Driven Bottom-Up Parsing: LR(0), SLR, LR(1), LALR Kanat Bolazar February 16, 2010 Table Driven Parsers • Both top-down and bottom-up parsers can be written that explicitly manage a stack while scanning the input to determine if it can be correctly generated from the grammar productions – In top-down parsers, the stack will have non-terminals which can be expanded by replacing it with the right-hand-side of a production – In bottom-up parsers, the stack will have sequences of terminals and non-terminals which can be reduced by replacing it with the non-terminal for which it is the rhs of a production • Both techniques use a table to guide the parser in deciding what production to apply, given the top of the stack and the next input Top-Down and Bottom-Up Parsers • Predictive parsers are top-down, non-backtracking – Sometimes called LL(k) • Scan the input from Left to right • Generates a Leftmost derivation from the grammar • k is the number of lookahead symbols to make parsing deterministic – If a grammar is not in an LL(k) form, removing left recursion and doing left-factoring may produce one • Not all context free languages can have an LL(k) grammar • Shift-reduce parsers are bottom-up parsers, sometimes called LR(k) – Scan the input from Left to Right – Produce a Rightmost derivation from the grammar • Not all context free languages have LR grammars Bottom-Up (Shift-Reduce) Parsers • Also called Shift-Reduce Parser because it will either – Reduce a sequence of symbols on the stack that are the rhs of a production by their non-terminal – Shift an input symbol to the top of the stack Input: Stack: X Y Z eot a + b eot Output Shift Reduce Parser Parsing Table: M Shift Reduce Parser Actions • During the parse – the stack has a sequence of terminal and non-terminal symbols representing the part of the input worked on so far, and – the input has the remaining symbols • Parser actions – Reduce: If the stack has a sequence FE and there is a production N E, we can replace E by N to get FN on the stack. – Shift: If there is no possible reduction, transfer the next input. symbol to the top of the stack. – Error: Otherwise it is an error. • If, after a reduce, we get the start symbol on the top of the stack and there is no more input, then we have succeeded. Handles • During the parse, the term handle refers to a sequence of symbols on the stack that – Matches the rhs of a production – Will be a step along the path of producing a correct parse tree • Finding the handle, i.e. identifying when to reduce, is the central problem of bottom-up parsing • Note that ambiguous grammars do not fit (as they didn’t for top down parsing, either) because there may not be a unique handle at one step – E.g. dangling else problem LR Parsing • A specific way to implement a shift reduce parser is an LR parser. • This parser represents the state of the stack by a single state symbol on the top of the stack • It uses two parsing tables, action and goto – For any parsing state and input symbol, the action table tells what action to take • Sn, meaning shift and go to state n • Rn, meaning reduce by rule n • Accept • Error – For any parsing state and non-terminal symbol N, the goto table gives the next state when a reduce has been performed to the non-terminal symbol N LR Parser • A Shift Reduce parser that encodes the stack with a state on the top of the stack – The TOS state and the next input symbol are used to look up the parser’s actions and goto function from the table Input: Stack: X S1 Y S2 Z S3 eot a + b eot Output LR Parser Parsing Table: M Types of LR Parsers • LR parsers can work on more general grammars than LL parsers – Has more history on the stack to make decisions than top-down • LR parsers have different ways to generate the action and goto tables • Types of parsers listed in order of increasing power (ability to handle grammars) and decreasing efficiency (size of the parsing tables becomes very large) – – – – LR(0)Standard/general LR, with 0 lookahead SLR(1) "Simple LR" (with 1 lookahead) LALR(1) "Lookahead LR", with 1 lookahead LR(1)Standard LR, with 1 lookahead Types of LR Parsers: Comparisons • Here's a subjective (personal) comparison of grammars • LALR class of grammars is the most useful and most complicated Grammar (lookahea Name d) LR(0) SLR(1) simpl e LALR(1) lookahea d LR(1) Power Table Size (+:small) -too weak + weak = - (was + popular before LALR) + = or ~= SLR --complicate d ++ balanced ++ --10x - -too large Conceptual Complexity Utility / Popularity --never used LR(0) Parsing Tables • Although not used in practice, LR(0) table construction illustrates the key ideas • Item or configuration is a production with a dot in the middle, e.g. there are three items from A XY: A  •XY X will be parsed next A  X•Y X parsed; Y will be parsed next A  XY• X and Y parsed, we can reduce to A • The item represents how much of the production we have seen so far in the parsing process. LR(0): Closure and Goto Operations • Closure is defined to construct a configurating set for each item. For the starting item, N W•Y – N  W•Y is in the set – If Y begins with a terminal, we are done – If Y begins with a non-terminal N’, add all N’ productions with the dot at the start of the rhs, N’  •Z • For each configurating set and grammar symbol, the goto operation gives another configurating set. – If a set of items I contains N  W • x Y, where W and Y are sequences but x is a single grammar symbol, the goto(I,x) contains N  W x • Y • To create the family of configurating sets for a grammar, add an initial production S’  S, and construct sets from S’  • S • Use the sets for parser states – states that end with a dot will be reduce LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E closure ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 more ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• more? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 action(s3, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 s5: ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 s5: E  E - 1• action(s5, ?) = ? LR(0) Example • Consider the simple grammar, add an initial rule: EE-1|1 rule1: E  E - 1 rule2: E  1 SE start symbol added for LR(0) • The states are: s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 s5: E  E - 1• action(s5, on any token) = reduce by rule1 LR(0) Example: Table s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 State Action Goto s5: E  E - 1• action(s5, on any token) = reduce by rule1 s1 s2 s3 s4 1 EOT E LR(0) Example: Table s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 State Action Goto s5: E  E - 1• action(s5, on any token) = reduce by rule1 s1 EOT s2 s2 r2 s3 s4 s4 1 r2 s3 r2 accept s5 E LR(0) Example: Table s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 State Action Goto s5: E  E - 1• action(s5, on any token) = reduce by rule1 - 1 EOT E s1 err s2 err s3 s2 r2 r2 r2 s3 s4 err accept s4 err s5 err s5 r1 r1 r1 Limitations of LR(0) • Since there is no look-ahead, the parser must know whether to shift or reduce based on the parsing stack so far – A configurating set can have only (all) shift(s) or reduce and not both based on the input (eg. we can't shift for '-' and reduce for '1') • Problematic examples – Epsilon rules create shift/reduce conflict if there are other rules – Items like these have shift/reduce conflicts: T  id• T  id• [ E ] reduce? shift? – Items like these have reduce/reduce conflicts SLR(1) Parsing • SLR(1), simple LR, uses the same configurating sets, table structures and parser operations. • When assigning table actions, don’t assume that any completed item should be reduced – Look ahead by using the Follow set of the item – Reduce an item N  Y • only if the next input symbol is in the Follow set of N. • The configurating sets may have shift and reduce in the same set, but the Follow sets are required to be disjoint – This requires that there are no reduce/reduce conflicts in this state s1: SLR(1) Table: Reduce Depends on Token S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, {-, EOT}) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 SLR(1) Action Goto s5: E  E - 1• action(s5, {-, EOT}) = reduce by rule1 State - s1 1 EOT s2 s3 s2 r2 r2 s3 s4 accept s4 s5 s5 r1 E r1 LR(0) Table For Comparison s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 LR(0) Action Goto s5: E  E -1• action(s5, on any token) = reduce by rule1 State - s1 EOT s2 s2 r2 s3 s4 s4 s5 1 r2 s3 r2 accept s5 r1 r1 E r1 LR(1) Parsing • Although SLR(1) is using 1 lookahead symbol, it is still not using all of the information that could be obtained in a parsing state by keeping track of what path led to that item • Not every item in Follow(X) is possible in every rule of X • In LR(1) parsing tables, we keep the lookahead in the parsing state and separate those states, so that they can have more detailed successor states: – A -> B C • D E F , a/b/c – A will eventually be reduced, if the following lookahead token after F is one of {a, b, c} – if any other token is seen, some other action may be taken – if there is no action, it's an error • Leads to larger numbers of states (in thousands, instead of LALR(1) parsing • Compromises between the simplicity of SLR and the power of LR(1) by merging similar LR(1) states. – Identify a core of configurating sets and merge states that differ only by lookahead – This is not just SLR because LALR will have fewer reduce actions, but it may introduce reduce/reduce conflicts that LR(1) did not have • Constructing LALR(1) parsing tables – is not usually done by brute force to construct LR(1) and then merge sets – As configurating sets are generated, a new configurating set is examined to see if it can be merged with an existing one More on LR Parsing • Almost all SR parsing done with automatically generating parser tables • Look at the types of parsers in available parser generators – http://en.wikipedia.org/wiki/Category:Parsing_algorithm s – Note types of parsers (but not types of trees) • Bison (yacc) • ANTLR • JavaCC • Coco/R • Elkhound One More Type of SR Parsing • Operator precedence parsing – Useful for expression grammars and other types of ambiguities • Doesn’t use a table, just uses operator precedence rules to resolve conflicts • Fits in with the various types of LR parsers – In addition to the action table, the parsing algorithm can appeal to a precedence operator table General Context Free Parsers • All of the table driven parsers work on grammars in particular forms and may not work for arbitrary CFGs, including ambiguous ones • General Backtracking Parsers O(n3) – CYK (Cocke, Younger, Kasami) algorithm • Produces a forest of parse trees – Earley’s algorithm • Notable for carrying along partial parses (subtrees), the first of the Chart parsers • General Parallel Parser, can be O(n3) – GLR – copies the relevant parts of the LR stack and parses in parallel whenever there is a conflict – otherwise same as LALR Compiler Design 11. Table-Driven Bottom-Up Parsing: LALR More Examples for LR(0), SLR, LR(1), LALR Kanat Bolazar February 23, 2010 Bottom-Up Parsers • • We have been looking at bottom-up parsers. They are also called shift-reduce parsers – – • • • shift: Put next token in the stack, move on reduce: Tokens combine to RHS of a rule; reduce this to the nonterminal on the left. Scan the input from Left to Right Produce a Rightmost derivation from the grammar Not all context free languages have LR grammars Shift-Reduce Parsing • Example: • Grammar: E -> 1 | E - 1 (rules 1 and 2) • Input: 1-1$ ($ : End of file / tape) • Steps: 1 shift E reduce by rule 1 Eshift E - 1 shift E - E reduce by rule 1 E reduce by rule 2 LR(0) Table: Reduce on Any Token s1: S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, on any token) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 LR(0) Action Goto s5: E  E - 1• action(s5, on any token) = reduce by rule1 State - s1 EOT s2 s2 r2 s3 s4 s4 s5 1 r2 s3 r2 accept s5 r1 r1 E r1 s1: SLR(1) Table: Reduce Depends on Token S  •E , E  •E - 1 , E  •1 action(s1, '1') = shift2 goto(s1, E) = s3 s2: E  1• action(s2, {-, EOT}) = reduce by rule2 s3: S  E• , E  E• - 1 act(s3, EOT)=accept act(s3, '')=s4 s4: E  E - •1 action(s4, '1') = shift5 SLR(1) Action Goto s5: E  E - 1• action(s5, {-, EOT}) = reduce by rule1 State - s1 1 EOT s2 s3 s2 r2 r2 s3 s4 accept s4 s5 s5 r1 E r1 LR(1) Parsing • Although SLR(1) is using 1 lookahead symbol, it is still not using all of the information that could be obtained in a parsing state by keeping track of what path led to that item • Not every item in Follow(X) is possible in every rule of X • In LR(1) parsing tables, we keep the lookahead in the parsing state and separate those states, so that they can have more detailed successor states: – A -> B C • D E F , a/b/c – A will eventually be reduced, if the following lookahead token after F is one of {a, b, c} – if any other token is seen, some other action may be taken – if there is no action, it's an error • Leads to larger numbers of states (in thousands, instead of LALR(1) parsing • Compromises between the simplicity of SLR and the power of LR(1) by merging similar LR(1) states. – Identify a core of configurating sets and merge states that differ only by lookahead – This is not just SLR because LALR will have fewer reduce actions, but it may introduce reduce/reduce conflicts that LR(1) did not have • Constructing LALR(1) parsing tables – is not usually done by brute force to construct LR(1) and then merge sets – As configurating sets are generated, a new configurating set is examined to see if it can be merged with an existing one Recap: LR(0), SLR, LR(1), LALR • LR(0): Don't look ahead when reducing according to a rule. When we reach the end of RHS, we reduce. • SLR = SLR(1): Use Follow set of nonterminal on the left. If the lookahead is in our Follow set, we reduce • LR(1): Add the expected lookahead for which we will eventually reduce. Produces very large tables. • LALR = LALR(1): Use LR(1), but combine states that differ only in lookahead. • Note: LALR is not SLR: S -> V = V | V = V + V • After first V, SLR would reduce if next token is '+', LR(1) and LALR wouldn't. Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar?  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do?  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.0 Let's start with a simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  What strings are allowed in this grammar? ab (from B b) aa  Consider seeing a string that starts with a: a ...  Should we shift a, or reduce a to B according to rule 3?  What would LR(0) parsing do? conflict: Can't parse!  Example 1.1 Original simple grammar: 1 S -> B b 2 S -> a a 3 B -> a  SLR: Follow(B) = {b} look ahead; shift on a, reduce on b  Can we make the grammar harder for SLR?  Example 1.1 Original simple grammar allows only "a a" and "a b": 1 S -> B b 2 S -> a a 3 B -> a  SLR: Follow(B) = {b} look ahead; shift on a, reduce on b  Can we make the grammar harder for SLR?  Of course! Just add 'a' to Follow(B) somehow!  Example 1.1  1 2 3    4 • • • Original simple grammar allows only "a a" and "a b": S -> B b S -> a a B -> a SLR: Follow(B) = {b} look ahead; shift on a, reduce on b Can we make the grammar harder for SLR? Of course! Just add 'a' to Follow(B) somehow: S -> b B a Grammar also allows "b a a" now. This should be irrelevant for "a a", But SLR can't decide: Follow(B) = {a, b}: Conflict for a a! Example 1.1 Modified grammar, reorganized: 1 S -> B b 2 S -> a a 3 S -> b B a 4 B -> a  Input: "a ... "  SLR: Follow(B) = {a, b} shift/reduce conflict on a, reduce on b  LR(1): State 0: S' -> . S , $ S -> . B b , $ S -> . a a , $  Example 1.1 Modified grammar, reorganized: 1 S -> B b 2 S -> a a 3 S -> b B a 4 B -> a  Input: "a ... "  SLR: Follow(B) = {a, b} shift/reduce conflict on a, reduce on b  LR(1): State 0: S' -> . S , $ S -> . B b , $ S -> . a a , $  Example 1.1 Modified grammar, reorganized: 1 S -> B b 2 S -> a a 3 S -> b B a 4 B -> a  Input: "a ... "  SLR: Follow(B) = {a, b} shift/reduce conflict on a, reduce on b After seeing a, transition to: State 1:  LR(1): State 0: S' -> . S , $ S -> . B b , $ S -> . a a , $  Example 1.1 Modified grammar, reorganized: 1 S -> B b 2 S -> a a 3 S -> b B a 4 B -> a  Input: "a ... "  SLR: Follow(B) = {a, b} shift/reduce conflict on a, reduce on b After seeing a, transition to: State 1: S -> a . a , $  LR(1): State 0: B -> a . , b S' -> . S , $ Reduce ? Shift ? Which lookahead? S -> . B b , $ S -> . a a , $  Example 1.1 Modified grammar, reorganized: 1 S -> B b 2 S -> a a 3 S -> b B a 4 B -> a  Input: "a ... "  SLR: Follow(B) = {a, b} shift/reduce conflict on a, reduce on b After seeing a, transition to: State 1: S -> a . a , $  LR(1): State 0: B -> a . , b S' -> . S , $ Reduce for b, shift for a. S -> . B b , $ S -> . a a , $  LR(1) vs LALR • We went through the previous example with LALR in the class. • The states and transitions were mostly the same as those of LR(1). Example 2 Assignment statement with variables: 1 S -> V = V 2 S -> V = V + V 3 V -> id Use LR(0), SLR, LR(1), LALR. Shown in the class. SLR doesn't know that an initial V can't be followed by '+' or $ (EOF). LR(1) knows it; the '=' that must follow is attached to V rule: s0: S -> . V = V , $ S -> . V = V + V , $ V -> . id lines) , = (due to previous two Example 3 Another grammar (for regular expression a*ba*b): 1 S -> X X 2 X -> a X 3 X -> b  Create the LR(0), SLR and LR(1) tables for table-driven parsing.  Draw the states and state transitions for one of these tables.  Compare it to the minimal a*ba*b Finite State Machine a a below. FSM to recognize a*ba*b b bb, abb,  In LR(0), before Accepts reducing addsbab, extra s0 not looking s1 bahead s2 baab, abaaab ... states. Example 4 A harder grammar: 1 S -> a X c 2 S -> b Y c 3 S -> a Y d 4 S -> b X d 5 X -> e 6 Y -> e   Use LR(0), SLR, LR(1), LALR We did not yet do this example in the class. We will, later, to remember how to do table-driven parsing. Compiler Design 13. Symbol Tables Kanat Bolazar March 4, 2010 Symbol Tables   The job of the symbol table is to store all the names of the program and information about each name In block structured languages, roughly speaking, the symbol table collects information from declarations and uses that information whenever a name is used later in the program    this information could be part of the syntax tree, but is put into a table for efficient access to names If there are different occurrences of the same name, the symbol table assists in name resolution Either the parser or the lexical analyzer can do the job of inserting names into the symbol table (as long as scope information is given to the lexer) Symbol Table Entries: Simple Variables, Basic Information  Variables (identifiers)       Character string (lexeme), may have limits on number of characters Data type Storage class (if not already implied by the data type) Name and lexical level of block in which it is declared Other access information, if necessary, such as modifiability constraints Base address and memory offset, after allocation Symbol Table Entries: Beyond Simple Variables  Arrays    Records and structures    List of fields Information about each field Functions and Procedures    Also needs number of dimensions Upper and lower bounds of each dimension Number and types of parameters Type of return value Function Pointers? Symbol Table Representation  The two main operations are     insert (name) makes an entry for this name lookup (name) finds the relevant occurrence of the name by searching the table Lookups occur a lot more often than insert Hash tables are commonly used  Because of good average time complexity for lookup (O(1)). var1 class1 fn1 var2 fn2 var3 Scope Analysis  The scope of a name is tied to the idea of a block in the programming language      Names must be unique within the block in which they are declared (no two objects with the same name in one block)   Standard blocks (statement sequences, sometimes if statement) Procedures and functions Program (global program level) Universe (predefined functions, etc.) There are some languages with exceptions for different types (a function and a variable may have same name) Name resolution: A use of a name should refer to the most local enclosing block that has a declaration for that name. Declaration Before Use?  We are dealing primarily with languages in which there are declarations of names required    Names of variables, constants, arrays, etc. must be declared before use Names of functions and procedures vary  C requires functions and procedures to also be declared before use, or at least given a prototype  Java does not require this for methods (can call first, define later in *.java file) Scope of a name (in a statically scoped language):   The scope of a constant, variable, array, etc. is from the end of its definition to the end of the block in which it is declared The scope of a function or procedure name Further Structure of Symbol Table    For nested scopes, we may use lists of hash tables, with one element of the list for each scope The lookup function will first search the current lexical level table and then continue on up the list, using the first occurrence of the name that it finds Parts of the table not currently active may be kept for future semantic analysis ● Table A Scope A Table B float x, y; x Scope B (nested) int x, z; x z y B.x shadows A.x ; lookup finds B.x first More Symbol Table Functions  In addition to lookup and insert, the symbol table will also need    initializeScope (level) , when a block is entered to create a new hash table entry in the symbol table list finializeScope (level), on block exit put the current hash table into a background list  Essentially makes a tree structure (scope A may contain scopes B1, B2, B3 ...), where one child may be distinguised as the active block The symbol tables shown so far are all for the program being compiled, also needed is a way to look up names in the “universe”  System-defined names (predefined types, functions, values) Example: Predeclared Names in MicroJava  Example: Predeclared in MicroJava: Types: int, char Constants: null Methods: ord(ch), chr(i), len(arr)  We can put these in the symbol table as well: Type int Type char Const null Param ) Method len Param ( Method chr Param Shown as a list here; symbol table is probably a hash table instead. Method ord Var ch Var i Var arr Alternate Representation  The lists of hash tables can be inefficient for lookup since the system has to search up the list of lexical levels   An optimization of the symbol table as lists of hash tables is to keep one giant hash table   More names tend to be declared at level 0, thus making the most common occurrence be the most expensive Within that table each name will have a list of occurrences identified b lexical level This representation keeps the (essentially) constant time looku  But makes leaving a block more expensive as hash table must be searched to find all entries that need to be removed and stored elsewhe Alternate Representation A Single Symbol Table  Faster lookup.  Slow scope close.  Must remove c1, c2 after scope C ends. B C b2 c1 a1 c2 ● b1 ● ● b1 b2 c1 c2 a1 Hierarchical Symbol Table  Faster scope close.  Slow lookup (of globals, especially) Static Scope  The scoping system described so far assumes that the scope ru are for static scoping   The static problem layout of enclosing blocks determines the scoping a name There are also languages with dynamic scoping   The scoping of a name depends on the call structure of the program at run-time The name resolution will be to the closest block on the call stack of a block with a declaration of that name – the most recently called functi or block Object-Oriented Scoping  Languages like Java must keep symbol tables for     One method of implementation is to attach a symbol table to each class with two nesting hierarchies    The code being compiled Any external classes that are known and referenced inside the code The inheritance hierarchy above the class containing the code One for lexical scoping inside individual methods One to follow the inheritance hierarchy of the class Resolving a name    First consult the lexically scoped symbol table If not found, search the classes in the inheritance hierarchy If not found, search the global name space Testing and Error Recovery  If a name is used, but the lookup fails to find any definition   If a name is defined twice   Give an error but enter the name with a dummy type information so th further uses do not also trigger errors Give an ambiguity error, choose which type to use in later analysis, usually the first Testing cases   Include all types of correct declarations Incorrect cases may include  Definition of an ambiguous name  Definition without a name  Meaningless recursive definitions (in some types of structures) References       Nancy McCracken's original slides Linz University Compiler course materials (MicroJava). Keith Cooper and Linda Torczon, Engineering a Compiler, Elsevier, 2004. Kenneth C. Louden, Compiler Construction: Principles and Practices, PWS Publishing, 1997. Per Brinch Hansen, On Pascal Compilers, Prentice-Hall, 1985. Out of print. Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006. (The purple dragon book) Compiler Design 14. AST (Abstract Syntax Tree) and Syntax-Directed Translation Kanat Bolazar March 9, 2010 Abstract Syntax Tree (AST) • The parse tree – contains too much detail • e.g. unnecessary terminals such as parentheses – depends heavily on the structure of the grammar • e.g. intermediate non-terminals • Idea: – strip the unnecessary parts of the tree, simplify it. – keep track only of important information • AST – Conveys the syntactic structure of the program while providing abstraction. – Can be easily annotated with semantic information Abstract Syntax Tree if-statement if-statement can become IF cond THEN statement E E id + E id cond statement add_expr can become E * E num id mul_expr id num Lexical, Parse, Semantic • Ultimate goal: Generate machine code. • Before we generate code, we must collect information about the program. • After lexical analysis and parsing, we are at semantic int func (int x, int y); analysis (recognizing meaning) int main () { int list[5], i, j; • There are issues deeper than structure. Consider: char *str; j = 10 + 'b'; str = 8; m = func("aa", j, list[12]); return 0; } Beyond Syntax Analysis • An identifier named x has been recognized. – Is x a scalar, array or function? – How big is x? – If x is a function, how many and what type of arguments does it take? – Is x declared before being used? – Where can x be stored? – Is the expression x+y type-consistent? • Semantic analysis is the phase where we collect information about the types of expressions and check for type related errors. • The more information we can collect at compile time, the less overhead we have at run time. Semantic Analysis • Collecting type information may involve "computations" – What is the type of x+y given the types of x and y? • Tool: attribute grammars – CFG – Each grammar symbol has associated attributes – The grammar is augmented by rules (semantic actions) that specify ho the values of attributes are computed from other attributes. – The process of using semantic actions to evaluate attributes is called syntax-directed translation. – Examples: • Grammar of declarations. • Grammar of signed binary numbers. Attribute grammars Example 1: Grammar of declarations Production Semantic rule DTL L.in = T.type T  int T.type = integer T  char T.type = character L  L1, id L1.in = L.in addtype (id.index, L.in) L  id addtype (id.index, L.in) Attribute grammars Example 2: Grammar of signed binary numbers Production Semantic rule NSL if (S.neg) print('-'); else print('+'); print(L.val); S+ S.neg = 0 S– S.neg = 1 L  L1 B L.val = 2*L1.val+B.val L B L.val = B.val B0 B.val = 0*20 B1 B.val = 1*20 Attributes • Attributed parse tree = parse tree annotated with attribute rules • Each rule implicitly defines a set of dependences – Each attribute's value depends on the values of other attributes. • These dependences form an attribute-dependence graph. • Note: – Some dependences flow upward • The attributes of a node depend on those of its children • We call those synthesized attributes. – Some dependences flow downward • The attributes of a node depend on those of its parent or siblings. • We call those inherited attributes. • How do we handle non-local information? – Use copy rules to "transfer" information to other parts of the tree. Attribute Grammars Production E  E1+E2 E  num E  (E1) Semantic rule E.val = E1.val+E2.val E.val = num.yylval E.val = E1.val attribute-dependence graph E num 2 E 12 + E ( E + E num 10 7 ) E num 3 Attribute Grammars • We can use an attribute grammar to construct an AST • The attribute for each non-terminal is a node of the tree. • Example: Production E  E1+E2 E  num E  (E1) Semantic rule E.node = new PlusNode(E1.node,E2.node) E.node = num.yylval E.node = E1.node • Notes: – yylval is assumed to be a node (leaf) created during scanning. – The production E  (E1) does not create a new node as it is not • • • • • • Evaluating Attributes Evaluation Method 1: Dynamic, dependence-based At compile time Build dependence graph Topsort the dependence graph Evaluate attributes in topological order This can only work when attribute dependencies are not circular. – – It is possible to test for that. Circular dependencies show up in data flow analysis (optimization) or may appear due to features such as goto Evaluating attributes • Other evaluation methods – Method 2: Oblivious • Ignore rules and parse tree • Determine an order at design time – Method 3: Static, rule-based • At compiler construction time • Analyze rules • Determine ordering based on grammatical structure (parse tree) Attribute grammars • We are interested in two kinds of attribute grammars: – S-attributed grammars • All attributes are synthesized (flow up) – L-attributed grammars • Attributes may be synthesized or inherited, AND • Inherited attributes of a non-terminal only depend on the parent or the siblings to the left of that non-terminal. – This way it is easy to evaluate the attributes by doing a depth-first traversal of the parse tree. • Idea (useful for rule-based evaluation) – Embed the semantic actions within the productions to impose an evaluation order. Embedding Rules in Productions • Synthesized attributes depend on the children of a nonterminal, so they should be evaluated after the children have been parsed. • Inherited attributes that depend on the left siblings of a nonterminal should be evaluated right after the siblings have been parsed. L.in is inherited and evaluated • after Inherited depend on the parent of a nonparsing attributes T but before that L T.type is synthesized and (more terminal are typically passed along through copy rules evaluated after parsing int Dlater).  T {L.in = T.type} L T  int {T.type = integer} T  char {T.type = character} L  {L1.in = L.in} L1, id {L.action = addtype (id.index, L.in)} L  id {L.action = addtype (id.index, L.in)} Rule Evaluation in Top-Down Parsing • Recall that a predictive parser is implemented as follows: – There is a routine to recognize each rhs. This contains calls to routines that recognize the non-terminals or match the terminals on the rhs of a production. – We can pass the attributes as parameters (for inherited) or return values (for synthesized). – Example: D  T {L.in = T.type} L T  int {T.type = integer} • The routine for T will return the value T.type • The routine for L, will have a parameter L.in • The routine for D will call T(), get its value and pass it into L() Rule Evaluation in Bottom-Up Parsing • S-attributed grammars – All attributes are synthesized – Rules can be evaluated bottom-up • Keep the values in the stack • Whenever a reduction is made, pop corresponding attributes, compute new ones, push them onto the stack • Production Example:L  Implement E \n E  E1 + T – Grammar: ET T  T1* F T F F  (E) F  digit Semantic rule aprint(E.val) desk calculator using E.val = E1.val+T.val E.val = T.val T.val = T1.val*F.val T.val = F.val F.val = E.val F.val = yylval an LR parser Rule Evaluation in Bottom-Up Parsing Production Semantic rule Stack operation L  E \nprint(E.val) E  E1 + T E.val = E1.val+T.val ET E.val = T.val T  T1* F T.val = T1.val+F.val T F T.val = F.val F  (E) F.val = E.val F  digit F.val = yylval val[newtop]=val[top-2]+val[top] val[newtop]=val[top-2]*val[top] val[ntop]=val[top-1] Attribute Grammars • Attribute grammars have several problems – Non-local information needs to be explicitly passed down with copy rules, which makes the process more complex – In practice there are large numbers of attributes and often the attribute themselves are large. Storage management becomes an important issue then. – The compiler must traverse the attribute tree whenever it needs information (e.g. during a later pass) • However, our discussion of rule evaluation gives us an idea for simplified (albeit limited) approach: – Have actions organized around the structure of the grammar – Constrain attribute flow to one direction. – Allow only one attribute per grammar symbol. In Practice: Yacc/Bison • In Yacc/Bison, $$ is used for the lhs non-terminal, $1, $2, $3, . are used for the non-terminals on the rhs, (left-to-right order) • Example: – Expr : Expr TPLUS Expr { $$ = $1+$3;} • Example: – Expr: Expr TPLUS Expr {$$ = new ExprNode($1, $3);} Compiler Design 15. ANTLR, ANTLRWorks Lexer and Parser Generator Kanat Bolazar March 11, 2010 ANTLR • ANTLR is a popular lexer and parser generator in Java. • It allows LL(*) grammars, does top-down parsing. • Similarities with LL(1) grammar: – – – – Does top-down parsing Grammar has to be fixed to remove left recursion Uses lookahead tokens to decide which path to take You can think of it as recursive-descent parsing. • Differences: – How far we can look ahead is not constrained – CommonTokenStream defines LA(k) and LT(k): • Both look ahead to k-th next token ANTLRWorks • ANTLRWorks is ANTLR IDE (integrated dev environ) • It has many nice features: – – – – Automatically fills in common token definitions Has standard IDE features like syntax highlighting Regexp FSM (lexer machine) for tokens Has a very nice debugger which can show: • • • • input and output parse tree and AST (abstract syntax tree) call (rule) stack and events grammar rule that is being executed Running ANTLR: Inputs, Steps • You need three files before you run ANTLR: – a grammar file, Xyz.g (Microjava.g) – a Java test runner, Test.java – a test input file, such as sample.mj • There are three steps to running ANTLR: – antlr: Generate lexer and parser classes: • XyzLexer.java • XyzParser.java – javac: Compile these two and Test.java • XyzLexer.class, XyzParser.class • Test.class Step 1. ANTLR • You may have an antlr executable: antlr Xyz.g • Make sure you save a "grammar Xyz" in file Xyz.g • If you only have a JAR file instead, use: java -jar antlr-3.2.jar Xyz.g • This creates two Java class source code files: XyzLexer.java XyzParser.java • By default, these files go in current directory • You can instead state where *.java should go: antlr -o src Xyz.g Step 2. Compile with javac • To lexer and parser, you need to add your runner: Test.java • See ANTLR examples online for runner examples. • Before javac, set CLASSPATH environment var to have: . antlrworks-1.3.1.jar (the current directory) • In Linux/Unix, under bash, you may do: export CLASSPATH=.:antlrworks-1.3.1.jar • Unlike this example, give full path to antlrworks JAR file. Step 3. Run with java • Again, set CLASSPATH environment var as before • Go under src if needed (if you used -o option) • Run your test, give your input file: java Test < input.txt java Test < input.txt > output.txt java Microjava < sample.mj • A grammar with no evaluation: – – will be quiet if everything is OK will only give syntax errors if input is not good • A grammar with output will display the output. • ANTLR doesn't allow running interactively ANTLRWorks, Other Java IDE • Instead of these steps, you can use ANTLRWorks. • To run under ANTLRWorks, just use its debugger. • It has ANTLR inside, and knows how to set the CLASSPATH for compiling and running. • *.java files produced by ANTLRWorks will be different, as they contain debugger commands. • To run ANTLR under a Java IDE, you may be able to define custom build rules for *.g files. • You should add the antlrworks JAR file to your project, to have ANTLR runtime libraries. • Make sure the libraries are used during both Next Steps • We will next see: – – A demonstration of using ANTLR (three steps) ANTLRWorks screenshots • We will also look at some grammar examples: – – – – – Calculator without evaluation Calculator with evaluation Calculator with AST MicroJava lexer Starting steps for MicroJava parser Compiler Design 16. Type Checking Kanat Bolazar March 23, 2010 Type Checking • The general topic of type checking includes two parts – Type synthesis – assigning a type to each expression in the language – Type checking – making sure that these types are used in contexts where they are legal, catching type-related errors • Strongly typed languages are ones in which every expression can be assigned an unambiguous type – Weakly typed languages could have run-time errors due to type incompatibility • Statically typed languages vs. dynamically type languages – Capable of being checked at compile time vs. not • Static type checking vs. dynamic type checking – Actually check types at compile time vs. run time Dynamic Typing Example: Duck Typing • Rule: "If it walks like a duck, quacks like a duck, call it a duck • No need to declare ahead of time as subtype of duck • Just define the operations. Python example: class Person: def quack(self): print "The person imitates a duck." def in_the_forest(duck): duck.quack() def game(): Base Types • Numbers – integer • C specifies length in relative terms, short and long; OS and machine-dependent; makes porting programs to other OS and machine harder. • Java specifies specific lengths: byte (8), short (16), int (32) and long (64) bits respectively. – floating point numbers • Many languages have two sizes • Can use IEEE representation standards • Java float and double are 32 and 64 bits • Characters – Single letter, digit, symbol, etc; used to be 8 bit ASCII standard – Now can also be a 16 bit Unicode • Booleans Java Example: Strings are Objects • Some languages have strings as a base type with catenation operators • In Java, strings are objects of class String: – "this".length() returns 4 • In Java, variables of an object type hold reference only: – String a = "this"; – String b = a; // reference to same string object, ref count = 2 – if (a == b) ... // reference comparison: returns true • This last check is not how you want to check String compariso reference may not be same but value equal: – if (a.equals(b)) ... // true if a and b have same string value – if (a == b) ... // true only if a and b point to same object in heap Compound Types: Arrays and Strings • Arrays – aggregate of values of the same type – Arrays have a base type for elements, may have an indexing range for each dimension int a [100][25], in C – If the indexing range is known, then the compiler can compute space allocation, otherwise indexing is relative and space is allocated by a run-time allocator – The main operation is indexing; some languages allow whole array operations • Strings – sequence of characters – Also can have bit strings – C treats strings as arrays of characters – Strings can have comparison operators that use lexicographic More Compound Types • Records or Structures – components may have different types and may be indexed by names struct { double r; int i; } – Representation as ordered product of elements • Variant records or Unions – a component may be one of a choice of types union { double r; int i; } – Representation can save space for the largest element and may have a tag to indicate which type the value is – Take care not to have run-time (value) errors Other Types • Enumerated types – the programmer can create a type name for a specific set of constant values enum WeekDay {Sunday, Monday, … Saturday} – Representation as for a small set • Pointers – abstraction of addresses – Can create a reference to, or derefence an object – Distinguish between “pointer to integer” and “pointer to boolean”, etc. – C allows arithmetic on pointers • Void • Classes – may or may not create new types – Classes can be represented by an extended type of record for Function Types, New Type Names • Procedure and function types are sometimes called signatures – Give number and types of parameters • May include parameter-passing information such as by value or by reference – Give type of result (or indicate no result) strlength : String -> unsigned int • Type declarations or type definitions allow the programmer to assign new type names to a type expression – These names should also be stored in the symbol table – Will have scope and other attributes – Definitions may be recursive in some languages • If so, size of object will be unknown Representing Types • Types can be represented as expressions, or quite often as trees: array(9) function arglist type result type arg1 type argn type Type Equivalence • Name equivalence – in this kind of equivalence, two types mu have the same name – Assumes that the programmer will introduce new names exactly when they want types to be different if you say t1 = int and t2 = int, t1 and t2 are different • Structural equivalence – two objects are interchangeable if the types have the same fields with equivalent types. int x[10][10] and int y[10][10] x and y have equivalent types, the 10 by 10 arrays – More complex situations arise in structural equivalence of other compound types, e.g. may have mutually recursive type definitions • Type checking rules extend this to a notion of type compatibili Type Synthesis and Type Checking • Assigning types to language expressions (type synthesis) can b done by a traversal of the abstract syntax tree • At each node, a type checking rule will say which types are allowed • Description here is for languages with type declarations – Constants are assigned a type • If there is not a known type, then there will be a set of possibles – Variables are looked up in the symbol table • Note that we are assuming an L-attributed grammar so that declarations are processed first – Assignment • The type of the assignable entity on the left must be the typeEqual Types for Expressions • Arithmetic and other operators have result types defined in terms of the types of the subnodes in the tree • Statements have substructures that need to be checked for type correctness – Condition of if and while statements must have type boolean • Array reference – Suppose we have exp1 -> exp2[exp3], then an adhoc SDT: if (isArrayType(exp2.type)) and typeEqual(exp3.type, integer) then exp1.type = exp2.type.basetype // get the basetype child else type-error(exp1) • Function calls have similar rules to check the signature of the function name the parameters Issues for typeEqual • This is sometimes called type compatibility • Overloading – Arithmetic operators 2 + 3 means integer addition 2.0 + 3.0 means floating pt addition – Language may have an arithmetic operator table, telling which type of operatorType willof: be used a based b on the expressions a+b int int int float float double int float double float double double int float double float double double Overloading Functions • Can declare the same function (or method) name with differen numbers and types of parameters int max(int x,y) double max(double x,y) – Java and C++ allow such overloaded declarations • Need to augment the symbol table functionality to allow for th name to have multiple signatures – The lookup procedure is always give a name to look up – we can add a typelist argument and the lookup can tell us if there is a function declared with that signature – Or the lookup procedure is given the name to look up – and in the case of a method, it can return sets of allowable signatures Conversion and Coercion • The typeEqual comparison is commonly extend to allow arithmetic expressions of mixed type and for other cases of types which are compatible, but not equal – If mixed types are allowed in an arithmetic expression, then a conversion should be inserted (into the AST or the code) 2 * 3.14 becomes code like t1 = float ( 2) t2 = t1 * 3.14 • Conversion from one type to another is said to be implicit it if done automatically by the compiler, and is also called coercion • Conversion is said to be explicit if the programmer must write the conversion – Called casts in C and Java languages Widening and Narrowing • The rules for Java type conversion distinguishes between – Widening conversions which preserve information – Narrowing conversion which can lose information • Conversions between primitive types in Java: (there is also widening and narrowing for references, i.e. objects) Widening Narrowing (usually with a cast) double double float float long long int int short byte char char short byte Generating Type Conversions • For an arithmetic expressions e1  e2 op e3, the algorithm can generally use widening, to possibly a third type that is greater than both the types of e2 and e3 in the widening tree: Let the new type of e1 be the max of e2.type and e3.type Generate a widening conversion of e1 if necessary Generate a widening conversion of e2 if necessary Set the type (and later the code) of e1 • Type conversions and coercions also apply to assignment if r is a double and i is an int: allow r = i; – C also allows i = r, with the corresponding loss of information – For classes, we have the subtype principle, objects of subclasses Continuing Type Synthesis and Checking Rules • Expressions: the two expressions involved with boolean operators, such as &&, must both be boolean • Functions: the type of each actual parameter must be typeEqu to its formal parameter • Classes: – if specified, the parent of the class must be a properly declared class – If a class says that it implements an interface, then all methods of the interface must be implemented Polymorphic Typing • A language is polymorphic is language constructs can have more than one type procedure swap(anytype x, y) where anytype is considered to be a type variable • Polymorphic functions have type patterns or type schemes, instead of actu type expressions • The type checker must check that the types of the actual parameters fit the pattern – Technically, the type checker must find a substitution of actual types for type variables that satisfies the type equivalence between the formal type pattern a the actual type expression – In complex cases with recursion, may need to do unification to solve the substitution problem • Most notably in the language ML References • Original slides by Nancy McCracken. • Keith Cooper and Linda Torczon, Engineering a Compiler, Elsevier, 2004. • Kenneth C. Louden, Compiler Construction: Principles and Practices, PWS Publishing, 1997. • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006. (The purple dragon book) • Charles Fischer and Richard LeBlanc, Jr., Crafting a Compiler with C, Benjamin Cummings, 1991. Compiler Design 18. Object Oriented Semantic Analysis (Symbol Tables, Type Checking) Kanat Bolazar March 30, 2010 Object-Oriented Symbol Tables and Type Checking • In object-oriented languages, scope changes can occur at a finer-grained level than just block or procedure definition – Each class has its own scope – Variables may be declared at any time • Notation to represent symbol tables as a combination of environments An environment maps an identifier to its symbol table entry, which we only give the type here for brevity: e1 = { g → string, a → int } – We will indicate adding to the symbol table with a +, but this addition will carry the meaning of scope (from right to left): e2 = e1 + { a → float } Example Scope in Java (environment e0 already given for predefined identifiers) 1 class C { 2 int a, b, c; e1 = e0 + { a → int, b → int, c → int} 3 public void m ( ) { 4 System.out.println(a + c) 5 int j = a + b; e2 = e1 + { j → int} 6 String a = ‘hello’; 7 System.out.println(a); e3 = e2 + { a → string} 8 System.out.println(j); 9 System.out.println(b); e1 10 } 11 } e0 Note: All examples in these slides are from Andrew Appel "Modern Compiler Implementation in Java" (available online through SU library) Each Class Has An Environment • There may be several active environments at once (multiple symbol tables) • Class names are mapped to environments package M; (each one added to environment e0): class E { static int a = 5; } e1 = { a → int } e2 = { E → e1 } Class N { static int b = 10; static int a = E.a + b; } Class D { static int d = E.a + N.a; } e3 = { b → int, a → int} e4 = { N → e3 } e5 = { d → int } e6 = { D → e5 } e7 = e2 + e4 + e6 Classes E, N and D are all compiled in environment e7: M → e7 Symbol Table • Each variable and formal parameter name has a type • Each method name has its signature • Each class name has its variable and method class B { B declarations Fields: C f; int[] j; int q; C public int start (int p, int q) { int ret, a; /* . . . */ return ret; } public boolean stop (int p) { /* . . . */ return false; } } f C j int[] q int Methods: start int stop bool Params: p int q int Locals: ret int a int Params: p int Locals: Typechecking Rules • Additional rules include – The new keyword: C e = new C ( ) • Gives type C to e (as usual) – Method calls of the form e.m ( <paramlist>) • Suppose e has type C • Look up definition of m in class C • Appel recommends the two-pass semantic analysis strategy – First pass adds identifiers to the symbol table – Second pass looks up identifiers and does the typechecking Example of Inheritance • Note variables in scope of await definition: passengers, position, v, this • In c.await(t), in the body of wait, v.move will be the move method from Truck class Vehicle { int pos; // position of Vehicle void move (int x) { pos += x; } } class Car extends Vehicle { int passengers; void await(Vehicle v) { // if ahead, ask other to catch up if (v.pos < pos) v.move(pos – v.pos); // if behind, catch up with +10 moves else this.move(10); } } class Truck extends Vehicle { void move(int x) { // max move: +55 if (x <= 55) pos += x; } } class Main { public static void main(…) Truck t = new Truck(); Car c = new Car(); Vehicle v = c; c.passengers = 2; c.move(60); v.move(70); c.await(t); } } Single Inheritance of Data Fields (Locations) • To generate code for v.position, the compiler must generate code to fetch the field position from the object that v points to – v may actually be a Car or Truck • Prefixing fields: – When B extends A, the fields of B that are inherited from A are laid out in a record at the beginning in the order they appear. All children of A have field a as field[0] – class Fields A not inherited are laid out in order afterwards { int a = 0;} class B extends A { int b = 0; int c = 0;} class C extends A {int d = 0;} class D extends B { int e = 0; } A a B a b c C a d D a b c e Single Inheritance for Methods (Locations) • A method instance is compiled into code that resides at a particular address. In semantic analysis phase: – Each variable’s symbol table entry has a pointer to its class descriptor – Each class descriptor contains a pointer to its parent class and a list of method instances – Each method instance has a location • Static methods: method call of the form c.f ( ) – the code for a method declared as static depends on the type of the variable c and not the type of the object that c holds – Get the class of c, call it C • in Java syntax, the method call is C.f ( ), making this clear – Search class C for method f – If not found, search the parent for f, then its parent and so on Single Inheritance for Dynamic Methods • Dynamic method lookup needs class descriptors – may be overridden is a subclass • To execute c.f(), the compiled code must execute instructions: – Fetch the class descriptor d at from object c at offset 0 – Fetch the method-instance pointer p from the f offset of d – Call p Instances of A, B, C, D and D class A { int x = 0; int f () { … } } x x x x x class B extends A { y y int g () { … } } class C extends B { A A_f B A_f C A_f D D_f int g () { … } } descriptors B_g C_g C_g class D extends C { int y = 0; int f () { … } Notation: A_f is an instance of method f declared in class A } Multiple Inheritance • In languages that permit multiple inheritance, a class can extend several different parent classes – Cannot put all fields of all parents in every class • Analyze all classes to assign one offset location for every field name that can be used in every record with that field – Use class A { graph coloring algorithm, but still has large int offsets a = 0; with sparse use of offset numbers C A B } a a class B { int b = 0; int c = 0; b } c class C { d int d = 0; } class D extends A, B, C { int e = 0; } numbers of D a b c d e Multiple Inheritance Solutions • After graph coloring, assign offset locations and give a sparse representation that keeps which fields are in each record – Leads to another level in accessing fields and methods • Fetch the class descriptor d at from object c • Fetch the field-offset value from the descriptor • Fetch the method or data from the appropriate offset of d • The coloring of fields is done at link time, can still have problems with dynamic linking, where a new class can be loaded at run-time – Solved with hash tables of field names and access algorithms with additional overhead Type Coercions • Given a variable c of type C, it is always legal to treat c as if it were any supertype of c – If C extends B, and b has type B, then assignment “b = c;” is safe • Reverse is not true. Assignment “c = b;” is safe only if b is really (at run-time) an instance of C. – Safe object-oriented languages (Modula-3 and Java) will add to any coercion from a superclass to a subclass, a run-time typecheck that raises an exception unless b really is an instance of C. – C++ is unsafe in this respect. It allows a static cast mechanism. The dynamic cast mechanism does add the run-time checking. Private Fields and Methods • In the symbol table for every class C, for all the fields and methods, keep a flag to indicate whether that method or field i private. – When type checking c.v or c.f ( ) the compiler will check the symbol table flag and must use the context (i.e. whether the compiler is inside the declaration of the object) to decide whether to allow the access – This is another example of inherited attributes for the typechecking system • Additional flags can be kept for access at the subclass or package level – And additional context must be kept by the typechecking algorithm Main Reference • Slides were prepared by Nancy McCracken, using examples from: • Andrew Appel, Modern Compiler Implementation in Java, second edition, Cambridge University Press, 2002. – Available at SU library as an online resource, viewable when you log i to SU library. – You can read the whole book, chapter by chapter, but not download as PDF. Compiler Design 20. ANTLR AST (Abstract Syntax Tree) Kanat Bolazar April 6, 2010 ANTLR AST (Abstract Syntax Tree) Generation • ANTLR allows creation and manipulation of ASTs • You start with these options in your grammar: options { output = AST; ASTLabelType = CommonTree; }    If you skip the second line, you may have problems later. Instead of CommonTree where tokens are the nodes, you can create and use your own tree structure If you do, you also have to tell ANTLR how to convert a token to a node in your tree structure. Imaginary Tokens • CommonTree only allows Tokens as tree nodes • You may want to use nodes that never appear in your input stream as tokens • Declare "imaginary tokens" like this at top of your grammar: tokens { // Not needed here (because tokens exist): // CLASS -- use 'class' instead, one of our keywords (a token) // PROGRAM -- use 'program' instead // CONSTANT -- use 'final' instead VAR; // variable declaration (including arguments, fields, globals) TYP; // simple type such as int char void or class name ARRAY; // array type Default AST Output: Flat List • By default, AST output will be a flat list of all tokens • Parse tree will be ignored; nonterminals of the grammar can't appear in the AST tree: program : 'program' ID decl* '{' methodDecl * '}' ;   Tokens 'program', ID, '{', '}' can be used in the AST nodes Nonterminals decl, methodDecl* expand to their tokens: (nil 'program' 'P' 'int' 'a' ';' '{' 'void' 'main' '(' ')' '{' '}' '}') Rewrite Rules • Rewrite rules allow you to define your tree nodes per grammar rule, and for each alternative • For any occurence of nonterminal (such as decl below), an implicit list of all decl nodes is created and can be used: • Whatever is ignored in the rewrite rule is removed (ID, '{' and '}' below): program : 'program' ID decl* '{' methodDecl * '}' -> ^('program' decl+ methodDecl+) ;  We don't have a flat list anymore: ('program' ('int' 'a') ('int' 'b') ('void' 'main')) Inlined Rules • Inlined rules are very useful in expressions ^ Make this token the root ! Ignore this token • This works very well in nested expressions. • We'll see some examples with calculator AST grammar: expr: multExpr '+'^ multExpr; // looping version keeps creating new parent nodes expr: multExpr (('+'^|'-'^) multExpr)* ;  The second looping version, for a + b – 1 creates: ('-' ('+' 'a' 'b') '1')  Equivalent rewrite rule: expr: (multExpr -> multExpr) // close paranth: needed here Multiple Subtrees vs. Lists  Note the difference in using implicit lists here: // ^(VAR type ID)+ // generates many VAR nodes, one for each ID: // (VAR int x) (VAR int y) (VAR int z) // whereas: // ^(VAR ID+) // would generate one VAR node, with all IDS as children: // (VAR int x y z) // Any ID seen in the rule is added to an implicit // list of IDs, as used here. varDecl : type ID ( ',' ID )* ';' -> ^(VAR type ID)+; • Similar but different (pairwise matched): formPars: type ID ( ',' type ID )* -> ^(VARLIST ^(VAR type ID)+) ; Multiple Subtrees vs. Lists  Note the difference in using implicit lists here: // ^(VAR type ID)+ // generates many VAR nodes, one for each ID: // (VAR int x) (VAR int y) (VAR int z) // whereas: // ^(VAR ID+) // would generate one VAR node, with all IDS as children: // (VAR int x y z) // Any ID seen in the rule is added to an implicit // list of IDs, as used here. varDecl : type ID ( ',' ID )* ';' -> ^(VAR type ID)+; • Similar but different (pairwise matched): formPars: type ID ( ',' type ID )* -> ^(VARLIST ^(VAR type ID)+) ; Compiler Design 21. Intermediate Code Generation Kanat Bolazar April 8, 2010 Intermediate Code Generation • Forms of intermediate code vary from high level ... – Annotated abstract syntax trees – Directed acyclic graphs (common subexpressions are coalesced) • ... to the low level Three Address Code – Each instruction has, at most, one binary operation – More abstract than machine instructions • No explicit memory allocation • No specific hardware architecture assumptions – Lower level than syntax trees • Control structures are spelled out in terms of instruction jumps – Suitable for many types of code optimization Three Address Code • Consists of a sequence of instructions, each instruction may have up to three addresses, prototypically t1 = t2 op t3 • Addresses may be one of: – A name. Each name is a symbol table index. For convenience, we write the names as the identifier. – A constant. – A compiler-generated temporary. Each time a temporary address is needed, the compiler generates another name from the stream t1, t2, t3, etc. • Temporary names allow for code optimization to easily move instructions • At target-code generation time, these names will be Three Address Code Instructions • Symbolic labels will be used as instruction addresses for instructions that alter the flow of control. The instruction addresses of labels will be filled in later. L: t1 = t2 op t3 • Assignment instructions: x = y op z – Includes binary arithmetic and logical operations • Unary assignments: x = op y – Includes unary arithmetic op (-) and logical op (!) and type conversion • Copy instructions: – These may be optimized later. x=y Three Address Code Instructions • Unconditional jump: goto L – L is a symbolic label of an instruction • Conditional jumps: if x goto L and ifFalse x goto L – Left: If x is true, execute instruction L next – Right: If x is false, execute instruction L next • Conditional jumps: if x relop y goto L • Procedure calls. For a procedure call p(x1, …, xn) param x1 … param xn Three Address Code Instructions • Indexed copy instructions: x = y[i] and x[i] = y – Left: sets x to the value in the location [i memory units beyond y] (in C) – Right: sets the contents of the location [i memory units beyond y] to x • Address and pointer instructions: – x = &y sets the value of x to be the location (address) of y. – x = *y, presumably y is a pointer or temporary whose value is a location. The value of x is set to the contents of that location. – *x = y sets the value of the object pointed to by x to the value of y. • In Java, all object variables store references (pointers), and Strings and arrays are implicit objects: – Object o = "some string object", sets the reference o to hold the Three Address Code Representation • Representations include quadruples (used here), triples and indirect triples. • In the quadruple representation, there are four fields for each instruction: op, arg1, arg2 and result. – – – – Binary ops have the obvious representation Unary ops don’t use arg2 Operators like param don’t use either arg2 or result Jumps put the target label into result Syntax-Directed Translation of Intermediate Code • Incremental Translation – Instead of using an attribute to keep the generated code, we assume tha we can generate instructions into a stream of instructions • gen(<three address instruction>) generates an instruction • new Temp() generates a new temporary • lookup(top, id) returns the symbol table entry for id at the topmost (innermost) lexical level • newlabel() generates a new abstract label name Translation of Expressions • Uses the attribute addr to keep the addr of the instruction for that nonterminal symbol. S  id = E ; Gen(lookup(top, id.text) = E.addr) E  E1 + E2 E.addr = new Temp() Gen(E.addr = E1.addr plus E2.addr) E.addr = new Temp() Gen(E.addr = minus E1.addr) | - E1 | ( E1 ) E.addr = E1.addr | id E.addr = lookup(top, id.text) Boolean Expressions • Boolean expressions have different translations depending on their context – Compute logical values – code can be generated in analogy to arithmetic expressions for the logical operators – Alter the flow of control – boolean expressions can be used as conditional expressions in statements: if, for and while. • Control Flow Boolean expressions have two inherited attributes: – B.true, the label to which control flows if B is true – B.false, the label to which control flows if B is false – B.false = S.next means: if B is false, Goto whatever address comes after instruction S is completed. This would be used for S → if (B) S1 expansion Short-Circuit Boolean Expressions • Some language semantics decree that boolean expressions have so-called short-circuit semantics. – In this case, computing boolean operations may also have flowof-control Example: if ( x < 100 || x > 200 && x != y ) x = 0; Translation: if x < 100 goto L2 ifFalse x >200 goto L1 ifFalse x != y goto L1 L2: x = 0 L1: … Flow-of-Control Statements S  if ( B ) S1 | if ( B ) S1 else S2 | while ( B ) S1 if-else B.Code B.true to B.true to B.false if B.Code B.true S1.Code B.false = S.next … while S1.Code begin B.Code goto S.next B.False S.Next B.true S2.code … to B.true to B.false S1.Code goto begin B.false = S.next … to B.true to B.false Flow-of-Control Translations PS S  assign S  if ( B ) S1 S  if ( B ) S1 else S2 S  while (B) S1 S  S1 S2 S.Next = newlabel() || : Code P.Code = S.code || label(S.next) concatenation S.Code = assign.code operator B.True = newlabel() B.False = S1.next = S.next S.Code = B.code || label(B.true) || S1.code B.True = newlabel(); b.false = newlabel(); S1.next = S2.next = S.next S.Code = B.code || label(B.true) || S1.code || gen (goto S.next) || label (B.false) || S2.code Begin = newlabel(); B.True = newlabel(); B.False = S.next; S1.next = begin S.Code = label(begin) || B.code || label(B.true) || S1.code || gen(goto begin) S1.next = newlabel(); S2.next = S.next; Control-Flow Boolean Expressions B  B1 || B2 B1.true = B.true; B1.false = newlabel(); B2.true = B.true; B2.false = B.false; B.Code = B1.code || label(B1.false) || B2.code B  B1 && B2 B1.true = newlabel(); B1.false = B.false B2.true = B.true; B2.false = B.false B.Code = B1.code || label(B1.true) || B2.code B  ! B1 B1.True = B.false; B1.false = B.true; B.Code = B1.code B E1 rel E2 B.Code = E1.code || E2.code || gen( if E1.addr relop E2.addr goto B.true) || gen( goto B.false) B  true B.Code = gen(goto B.true) B  false B.Code = gen(goto B.false) Avoiding Redundant Gotos, Backpatching • Use ifFalse instructions where necessary • Also use attribute value “fall” to mean to fall through where possible, instead of generating goto to the next expression • The abstract labels require a two-pass scheme to later fill in the addresses • This can be avoided by instead passing a list of addresses that need to be filled in, and filling them as it becomes possible. This is called backpatching. Java Bytecode, Virtual Machine Instructions • Java bytecode is an intermediate representation. • It uses a stack-machine, which is generally at a lower level tha a three-address code. • But it also has some conceptually high-level instructions that need table lookups for method names, etc. • The lookups are needed due to dynamic class loading in Java: – If class A uses class B, the reference can only compile if you have access to B.class (or if your IDE can compile B.java to its B.class). – In runtime, A.class and B.class hold bytecode for class A and B. – Loading A does not automatically load B. B is loaded only if it is needed. – Before B is loaded, its method signatures (interfaces) are known but Displaying Bytecode • From command line, you can use this command to see the bytecode: javap -private -c MyClass • You need to have access to MyClass.class file • There are many options to see more information about local variables, where they are accessed in bytecode, etc. • Important: Stack machine stack is empty after each full instruction. • Example: d = a + b * c instruction stack iload_1 a iload_2 a,b iload_3 a,b,c description get local var #2, a, push it into stack push b into stack push c into stack (now, c is on top of stack) Method Call in Java Bytecode • Method calls need symbol lookup • Example: System.out.println(d); 18: getstatic #2; //Field java/lang/System.out:Ljava/io/PrintStream; 21: iload 4 23: invokevirtual #3; //Method java/io/PrintStream.println:(I)V • Java internal signature: Lmypkg.MyClass: object of MyClass, defined in package mypkg • Java internal signature: (I)V: takes integer, returns void • We will be focusing on MicroJava virtual machine instructions – Few instructions compared to full Java VM instructions – Simpler language features, less complicated References • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006. (The purple dragon book) Compiler Design 22. ANTLR AST Traversal (AST as Input, AST Grammars) Kanat Bolazar April 13, 2010 Chars → Tokens → AST → .... Lexer Parser Tree Parser ANTLR Syntax grammar file, name.g one rule /** doc comment */ kind grammar name; options {…} tokens {…} scopes… @header {…} @members {…} rules… /** doc comment */ rule[String s, int z] returns [int x, int y] throws E options {…} scopes @init {…} @after {…} :  |  ; catch [Exception e] {…} finally {…} Trees ^(root child1 … childN) What is LL(*)?   Natural extension to LL(k) lookahead DFA: Allow cyclic DFA that can skip ahead past common prefixes to see what follows Analogy: like trying to decide which line to get in at the movies: long line, can’t see sign ahead from the back; run ahead to see sign ticket_line : PEOPLE+ STAR WARS 9 | PEOPLE+ AVATAR 2 ;    Predict and proceed normally with LL parse No need to specify k a priori Weakness: can’t deal with recursive left-prefixes LL(*) Example s : ID+ ':' ‘x’ | ID+ '.' ‘y’ ; void s() { int alt=0; while (LA(1)==ID) consume(); if ( LA(1)==‘:’ ) alt=1; if ( LA(1)==‘.’ ) alt=2; switch (alt) { case 1 : … case 2 : … default : error; } } Note: ‘x’, ‘y’ not in prediction DFA Tree Rewrite Rules  Maps an input grammar fragment to an output tree grammar fragmentgrammar T; options {output=AST;} stat : 'return' expr ';' -> ^('return' expr) ; decl : 'int' ID (',' ID)* -> ^('int' ID+) ; decl : 'int' ID (',' ID)* -> ^('int' ID)+ ; Template Rewrite Rules  Reference template name with attribute assigments as args: grammar T; options {output=template;} s : ID '=' INT ';' -> assign(x={$ID.text},y={$INT.text}) ;  group T; assign is defined like this: Template assign(x,y) ::= "<x> := <y>;" ANTLR AST (Abstract Syntax Tree) Processing • ANTLR allows creation and manipulation of ASTs • 1. Generate an AST (file.mj → AST in memory) grammar MyLanguage; options { output = AST; ASTLabelType = CommonTree; 3. AST → action (Java): }  Interpreter; 2. Traverse, process AST → AST:grammar options { tree grammar TypeChecker; options { tokenVocab = MyLanguage; output = AST; ASTLabelType = CommonTree; tokenVocab = MyLanguage; } AST Processing: Calculator 2, 3 • ANTLR expression evaluator (calculator) examples: http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator • • We are interested in the examples that build an AST, and evaluate (interpret) the language AST. These are in the calculator.zip, as examples 2 and 3. grammar Expr; options { output=AST; ASTLabelType=CommonTree; } tree grammar Eval; options { tokenVocab=Expr; ASTLabelType=CommonTree; } Expr AST Eval grammar Expr; options { output=AST; ASTLabelType=CommonTree; } prog: ( stat {System.out.println( $stat.tree.toStringTree());} )+ ; tree grammar Eval; options { tokenVocab=Expr; ASTLabelType=CommonTree; } @header { import java.util.HashMap; } @members { HashMap memory = new HashMap(); } prog: stat+ ; stat: expr stat: expr NEWLINE -> expr {System.out.println($expr.value);} | ID '=' expr NEWLINE -> ^('=' ID expr) | ^('=' ID expr) | NEWLINE -> {memory.put($ID.text, new Integer($expr.value));} ; ; expr: multExpr (('+'^|'-'^) multExpr)* ; multExpr : atom ('*'^ atom)* ; atom: INT | ID | '('! expr ')'! ; expr returns [int value] : ^('+' a=expr b=expr) {$value = a+b;} | ^('-' a=expr b=expr) {$value = a-b;} | ^('*' a=expr b=expr) {$value = a*b;} | ID { Integer v = (Integer)memory.get($ID.text); if ( v!=null ) $value = v.intValue(); else System.err.println("undefined var "+$ID.text); } | INT {$value = Integer.parseInt($INT.text);} ; AST → AST, AST → Template • The ANTLR Tree construction page has examples of processing ASTs: – – – • AST → AST: Can be used for typechecking, processing (taking derivative of polynomials/formula) AST → Java (action): Often the final step where AST is needed no more. AST → Template: Can simplify Java/action when output is templatized Please see Calculator examples as well. They show which files have to be shared so tree grammars can be used. Our Tree Grammar • Look at sample output from our AST generator (syntax_test_ast.txt): 9. program X27 (program X27 10. 11. // constants 12. final int CONST = 25; 25) (final (TYP int) CONST 13. final char CH = '\n'; '\n') (final (TYP char) CH 14. final notype[] B3 = 35; B3 35) (final (ARRAY notype) 15. 16. // classes (types) 17. class Helper { 18. 19. 20. (class Helper // only variable declarations... int x; int) x) char y; (VARLIST (VAR (TYP (VAR (TYP char) y) Compiler Design 27. Runtime Environments: Activation Records, Heap Management Kanat Bolazar April 29, 2010 Run-time Environments • The compiler creates and manages a run-time environment in which it assumes the target program will be executed • Issues for run-time environment – Layout and allocation of storage locations for named program objects – Mechanism for the target program to access variables – Linkages between procedures – Mechanisms for passing parameters – Interfaces to the operating system for I/O and other programs Storage Organization • Assumes a logical address space – Operating system will later map it to physical addresses, decide how to use cache memory, etc. • Memory typically divided into areas for – Program code – Other static data storage, including global constants and compiler generated data – Stack to support call/return policy for procedures – Heap to store data that can outlive a call to a procedure Code Static Heap Free Memory Stack Run-time stack • Each time a procedure is called (or a block entered), space for local variables is pushed onto the stack. • When the procedure is terminated, the space is popped off the stack. • Procedure activations are nested in time – If procedure p calls procedure q, then even in cases of exceptions and errors, q will always terminate before p. • Activations of procedures during the running of a program can be represented by an activation tree • Each procedure activation has an activation record (aka frame) on the run-time stack. – The run-time stack consists of the activation records at any point in tim during the running of the program for all procedures which have been Procedure Example: quicksort int a[11]; void readArray( ) /* Reads 9 integers into a[1] through a[9] */ { int i; … } int partition ( int m, int n) { /* picks a separator v and partitions a[m .. n] so that a[m .. p-1] are less than v, a[p] = v, a[p+1 .. n] are equal to or greater than v. Returns p. */ } void quicksort (int m, int n) { int i; if ( n > m ) { i = partition(m,n); quicksort(m, i-1); quicksort(i+1, n); } } main ( ) { readArray ( ); a[0] = -9999; a[10] = 9999; quicksort (1, 9); } Activation Records • Elements in the activation record: – temporary values that could not fit into registers – local variables of the procedure – saved machine status for point at which this procedure called. includes return address and contents of registers to be restored. – access link to activation record of previous block or procedure in lexical scope chain – control link pointing to the activation record of the caller – space for the return value of the function, if any – actual parameters (or they may be placed in actual params return values control link access link saved machine state local data temporaries Procedure Linkage • The standardized code to call a procedure, the calling sequence, and the return sequence, may be divided between the caller and the callee. – parameters and return value should be first in the new activation record so that the caller can easily compute the actual params and get the return value as an extension of its own activation record • also allows for procedures with a variable number of params – fixed-length items are placed in the middle of the activation record • saved machine state is standardized – local variables and temporaries are placed at the end, especially good for the case when the size is not known until run-time, such as with dynamic arrays – location of the top-of-stack pointer is commonly at the end of the fixedlength fields • fixed length data can be accessed by local offsets, known to the intermediate code generator, relative to the TOP-SP (negative offsets) Activation Record Example • Showing one way to divide responsibility between the caller an the callee. actual params and return val Caller A.R. control link, access link, and saved machine state local data and temporaries actual params and return val Callee A.R. TOP-SP control link, access link, and saved machine state local data and temporaries actual top of stack Caller responsibility Callee responsibility Calling Sequence • A possible calling sequence matching the previous diagram: – caller evaluates the actual parameters – caller stores the return address and old value of TOP-SP in the callee’s AR. Caller then increments TOP-SP to the callee’s AR. (Caller knows the size of the caller’s local data and temps, and the callee’s parameters and status fields.). Caller jumps to callee code. – callee saves the register values and other status fields – callee initializes local data and begins execution Return Sequence • Corresponding return sequence – callee places the return value next to the parameters – using information in the status fields, callee restores TOP-S and other registers. Callee jumps to the return address that the caller placed in the status field – Although TOP-SP has been restored to the caller AR, the caller knows where the return value is, relative to the curren TOP-SP Variable-length data on the stack • It is possible to allocate objects, arrays or other structures of unknown size on the stack, as long as they are local to a procedure and become inaccessible when the procedure ends • For example, represent a dynamic array in the activation record by a pointer to an array located between the actual params and return val proc p control link, access link, and saved machine state pointer to array a … array a actual params and return val control link, access link, and saved machine state local data and temporaries proc q Access to Nonlocal Data on the Stack • Simplest case are languages without nested procedures or classes – C and many C-based languages • All variables are defined either within a single procedure (function) or outside of any procedure at the global level • Allocation of variables and access to variables – Global variables are allocated static storage. Locations are fixed at compile time. – All other variables must be local to the activation on the top of the stac These variables are allocated when the procedure is called and accesse via the TOP-SP pointer. • Nested procedures will use a set of access links to access Nested Procedure Example Outline in ML fun sort(inputFile, outputFile) = let val a = array (11, 0); fun readArray (inputFile) = ... a ... ; // body of readArray accesses a fun exchange ( i, j ) = ... a ... ; // so does exchange fun quicksort ( m, n ) = let val v = . . . ; fun partition ( y, z ) = . . . a . . . v . . . exchange . . . in . . . a . . . v . . . partition . . . quicksort . . . end in . . . a . . . readArray . . . quicksort . . . end;// the function sort accesses a and calls readArray and quicksort Access Links • Access links allow implementation of the normal static scope rule – if procedure p is nested immediately within q in the source code, then the access link of an activation of p points to the more recent activatio of q • Access links form a chain – one link for each lexical level – allowing access to all data and procedures accessible to the currently executing procedure • Look at example of access links from quicksort program in ML (previous slide) Defining Access Links for Direct Procedure Calls • Procedure q calls procedure p explicitly: – case 1: procedure p is at a nesting depth 1 higher than q (can’t be mor than 1 to follow scope rules). Then the access link is to the immediate preceding activation record (of p) (example quicksort calls partition) – case 2: recursive call, i.e. q is p itself. The access link in the new activation record for q is the same as the preceding activation record fo q (example: quicksort called quicksort) – case 3: procedure p is at a lower nesting depth than q. Then procedur p must be immediately nested in some procedure r (defined in r) and there must be an activation record for r in the access chain of q. Follo the access links of q to find the activation record of r and set the acces link of p to point to that activation record of r. (partition calls exchang which is defined in sort) Defining Access Links Parameter Procedures • Suppose that procedure p is passed to q as a parameter. When calls its parameter, which may be named r, it is not actually known which procedure to call until run-time. • When a procedure is passed as a parameter, the caller must als pass along with the name of the procedure, the proper access link for that parameter. • When q calls the procedure parameter, it sets up that access lin thus enabling the procedure parameter to run in the environme of the caller procedure. Displays display • If the nesting depth of access links gets d[1] large, then access to nonlocal variables d[2] will be inefficient to follow the chain of d[3] access links. • Solution is to keep an auxiliary array – the display – in which each element is the highest activation record on the stack for the procedure at that nesting depth. • Whenever a new activation record is created at level l, it will save the value of display[l] to restore when it is done stack sort q(1,9) saved d[2] q(1,3) saved d[2] p(1,3) saved d[3] e(1,3) saved d[2] Dangling pointers in the stack • In a stack-based environment, typically used for parameters and local variables, variables local to a procedure are removed from the stack when the procedure exits • There should not be any pointers still in use to such variables • Example from C: int* dangle(void) { int x; return &x; } • An assignment “addr = dangle();” causes addr to point Organization of Memory for Arrays • • C/C++ arrays and Java arrays are stored very differently in memory A 2x3 C int array only needs space for 6 ints in the heap: ar[0][0] , ar[0][1] , ar[0][2] , ar[1][0] , ar[1][1] , ar[1][2] • • • • • The same array can be accessed as an int[6] array. Java is type-safe; in Java, you can't access an int[2][3] as if it is an int[6]. Java also stores array length, and other Object information, including reference counts for garbage collection. All arrays are objects in heap. A local "array" variable is just a pointer/reference. This line creates three Array objects in Java: Heap Management • Store used for data that lives indefinitely, or until the program explicitly deletes it • Memory manager allocates and deallocates memory in the hea – serves as the interface between application programs, generated by the compiler, and the operating system – calls to free and delete can be generated by the compiler, or in some languages, explicitly by the programmer • Garbage Collection is an important subsystem of the memory manager that finds spaces within the heap that are no longer used and can be returned to free storage – the language Java uses the garbage collector as the deallocation operation Memory Manager • The memory manager has one large chunk of memory from the operating system that it can manage for the application program • Allocation – when a program requests memory for a variable or an object (anything requiring space), the memory manager gives it the address of a chunk of contiguous heap memory – if there is no space big enough, can request the operating system for virtual space – if out of space, inform the program • Deallocation – returns deallocated space to the pool of free space – doesn’t reduce the size of space and return to the operating Properties of Memory Manager • Space efficiency – minimize the total heap size needed by the program – accomplished by minimizing fragmentation • Program efficiency – make good use of the memory subsystem to allow programs to run faster – locality of placement of objects • Low overhead – important for memory allocation and deallocation to be as efficient as possible as they are frequent operations in many programs Memory Hierarchy • Registers are scarce – explicitly managed by the code generated by the compiler • Other memory levels are automatically handled by the operating system typical sizes > 40GB typical access times viritual memory 3-15 ms (disk) 512MB – 4GB physical memory 100-150ns 125KB – 4MB – chunks of memory 16 – 64KB copied from lower level to higher level as necessary 32 words 2nd-level 40-60ns cache 5-10ns 1st-level cache Registers 1ns Taking Advantage of Locality • Programs often exhibit both – temporal locality – accessed memory locations are likely to be accesse again soon – spatial locality – memory close to locations that have been accessed ar also likely to be accessed • Compiler can place basic blocks (sequential instructions) on th same cache page, or even the same cache line • Instructions belonging to the same loop or function can also be placed together Placing objects in the heap • As heap memory is allocated and deallocated, it is broken into free spaces, the holes, and the used spaces. – on allocation, a hole must be split into a free and used part • Best Fit placement is deemed the best strategy – uses the smallest available hole that is large enough – this strategy saves larger holes for later, possibly larger, requests • Contrasted to the First Fit strategy that uses the first hole on the list that is large enough – has a shorter allocation time, but is a worse overall strategy • Some managers use the “bin” approach to keeping track of free space – for many standard sizes, keep a list of free spaces of that size – keep more bins for smaller sizes, as they are more common Coalescing Free Space • When an object is freed, it will reduce fragmentation if we can combine the deallocated space with any adjacent free spaces • Data structures to support coalescing: – boundary tags – at each end of the chunk, keep a bit indicating whethe the chunk is free and keep its size – doubly linked, embedded free list – pointers to next free chunks are ke at each end next to the boundary tags • When B is deallocated, it can check if A and C are free and is s coalesce blocks and adjust the links of the free list chunk A 0:200: : chunk B : :200:0 0:100: : chunk C : :100:0 0:80: : pointers doubly link free chunks, not in physical order : :80:0 Problems with Manual Deallocation • It is a notoriously difficult tasks for programmers, or compilers to correctly decide when an object will never be referenced again • If you use caution in deallocation, then you may get chunks of memory that are marked in use, but are never used again – memory leaks • If you deallocate incorrectly, so that at a later time, a reference is used to an object that was deallocated, then an error occurs – dangling pointers Garbage Collection • In many languages, program variables have pointers to objects in the heap, e.g. through the use of new • These objects can have pointers to other objects • Everything reachable through a program variable is in use, and everything else in the heap is garbage – in an assignment “x = y”, an object formerly pointed to by x is now garbage if x were the last pointer to it • A requirement to be a garbage collectible language is to be type safe: – we can tell if a data element or component of a heap p q r Performance Metrics • Overall execution time – garbage collection touches a lot of da and it is important that it not substantially increase the total run time of an application • Space Usage – garbage collector should not increase fragmentation • Pause time – garbage collectors are notorious for causing the application to pause suddenly for a very long time, as garbage collection kicks in – as a special case, real-time applications must be assured that they can achieve certain computations within a time limit • Program locality – garbage collector also controls the placeme of data, particularly ones which relocate data Reachability • The data that can be accessed directly by the program, without any deferencing, is the root set, and its elements are all reachable – the compiler may have placed elements of the root set in registers or on the stack • Any object with a reference stored in the field members or array elements of any reachable object is also a reachable object • The program (sometimes called the mutator) can change the reachable set – object allocation by the memory manager – parameter passing and return values – objects pointed to by actual parameters and by return results remain reachable Reference Counting Garbage Collectors • Keep a count of the number of references to any object and when the coun drops to 0, the object can be returned to free • Every object keeps a field for the reference count, which is maintained: – object allocation – the count of a new object is 1 – parameter passing – the reference count of an actual parameter object is increased by 1 – reference assignments “x = y”: reference count of object referred to by y goe up by 1, reference count of old object pointed to by x is decreased by 1 – procedure returns – objects pointed to by local variables have counts decremented – transitive loss of reachability – whenever the count of an object goes to 0, we must decrement by 1 each of the objects pointed to by a reference within the object • Simple, but imperfect: cannot do circular objects Java Array Example • Recall that this line creates three Array objects: int[][] ar = new int[2][3]; • • // creates ar, ar[0] and ar[1] arrays Local variable ar stores address of the int[][] object Its elements store addresses of the two int[] objects, one per row. int[] row1 = ar[0]; // the first int[] object now has ref count = 2 ar[0] = null; // the same object now has ref count = 1, from row1 // the first int[] object is not reachable from ar anymore ar = null; // first int[][] object now has ref count = 0 • • transitive loss of reachability: ar is not reachable anymore anything reachable from it should have refCount decremented: Basic Mark and Sweep Garbage Collection • Trace based algorithms recycle memory as follows: – program runs and make allocation requests – garbage collector discovers reachability by tracing – unreachable objects are reclaimed for storage • The Mark and Sweep algorithms use four states for chunks of memory – Free – ready to be allocated, during any time – Unreached – reachability has not been established by gc, when a chunk is allocated, it is set to be “unreached” – Unscanned – chunks that are known to be reachable are either scanned or unscanned – an unscanned object has itself been reached, but its points have not been scanned – Scanned – the object is reachable and all its pointers have Mark and Sweep Algorithm • Stop the program and start the garbage collector • Marking phase: set Free list to be empty set the reached bit to 1 and add root set to the list Unscanned loop over unscanned list: remove object o from unscanned list for each pointer p in object: if p is unreached (bit is 0) set the bit to 1 and put p in unscanned list • Sweeping phase for each chunk of memory o in the heap if o is unreached, add o to the Free list Baker’s Mark and Sweep algorithm • The basic algorithm is expensive because it examines every chunk in the heap • Baker’s optimization keeps a list of allocated objects • This list is used as the Unreached list in the algorithm Scanned = empty set Unscanned = root set loop over Unscanned set move object o from Unscanned to Scanned for each pointer p in o: if p is Unreached, move p from Unreached to Unscanned Free = Free + Unreached Unreached = Unscanned Copying Collectors (Relocating) • While identifying the Free set, the garbage collector can reloca all reachable objects into one end of the heap – while analyzing every reference, the gc can update them to point to a new location, and also update the root set • Mark and compact moves objects to one end of the heap after the marking phase • Copying collector moves the objects from one region of memory to another as it marks – extra space is reserved for relocation – separates the tasks of finding free space and updating the new memory locations to the objects • gc copies objects as it traces out the reachable set Short-Pause Garbage Collection • Incremental garbage collection – interleaves garbage collection with the mutator – incremental gc is conservative during reachability tracing and only traces out objects which were allocated at the time it begins – not all garbage is found during the sweep (floating garbage), but will be collected the next time • Partial Collection – the garbage collector divides the work by dividing the space into subsets – Usually between 80-98% of newly allocated objects “die young”, i.e. die within a few million instructions, and it is cost effective to garbage collect these objects often – Generational garbage collection separates the heap into the “young” and the “mature” areas. If an object survives some number of “young” collections, it is promoted to the “mature” area. Parallel and Concurrent Garbage Collection • A garbage collector is parallel if it uses multiple threads, and it is concurrent if it runs in parallel with the mutator – based on Dijkstra’s “on-the-fly” garbage collection, coloring the reachable nodes white, black or gray • This version partially overlaps gc with mutation, and the mutation helps the gc: – Find the root set (with the mutator stopped) – Interleave the tracing of reachable objects with the mutator(s) • whenever the mutator writes a reference that points from a Scanne object to an Unreached object, we remember it (called the dirty objects) – Stop the mutator(s) to rescan all the dirty objects, which will be quick because most of the tracing has been done already Cost of Basic Garbage Collection • Mark phase: Depth-first search takes time proportional to the number of nodes that it marks, i.e. the number of reachable chunks • Sweep phase: time proportional to the size of the heap • Amortize the collection: divide the time spent collecting by th amount of garbage reclaimed: – R chunks of reachable data – H is the heap size – c1 is the time for each marked node and c2 the time to sweep (c1)R + (c2)H / H – R • If R is close to H, this cost gets high, and the collector could increase H by asking the operating system for more memory References • Dr. Nancy McCracken, Syracuse University. • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006. (The purple dragon book) • Keith Cooper and Linda Torczon, Engineering a Compiler, Elsevier, 2004. Compiler Design Yacc Example "Yet Another Compiler Compiler" Kanat Bolazar Lex and Yacc • Two classical tools for compilers: – Lex: A Lexical Analyzer Generator – Yacc: “Yet Another Compiler Compiler” (Parser Generator) • Lex creates programs that scan your tokens one by one. • Yacc takes a grammar (sentence structure) and generates a parser. Input Lexical Rules Grammar Rules Lex Yacc yylex() yyparse() Parsed Input Lex and Yacc • Lex and Yacc generate C code for your analyzer & parser. Grammar Rules Lexical Rules C code Lex Input yylex() char stream C code Lexical Analyzer (Tokenizer) C code token stream Yacc yyparse() C code Parser Parsed Input Flex, Yacc, Bison, Byacc • Often, instead of the standard Lex and Yacc, Flex and Bison are used: – Flex: A fast lexical analyzer – (GNU) Bison: A drop-in replacement for (backwards compatible with) Yacc • Byacc is Berkeley implementation of Yacc (so it is Yacc). • Resources: http://en.wikipedia.org/wiki/Flex_lexical_analyser http://en.wikipedia.org/wiki/GNU_Bison • The Lex & Yacc Page (manuals, links): http://dinosaur.compilertools.net/ Yacc: A Standard Parser Generator • • • • Yacc is not a new tool, and yet, it is still used in many projects Yacc syntax is similar to Lex/Flex at the top level. Lex/Flex rules were regular expression – action pairs. Yacc rules are grammar rule – action pairs. declarations %% rules %% programs Yacc Examples: Calculator • • • • A standard Yacc example is the int-valued calculator. Appendix A of Yacc manual at Lex and Yacc Page shows such a calculator. We'll examine this example in parts. Let's start with four operations: E -> E + E |E–E |E*E |E/E • Note that this grammar is ambiguous because 2 + 5 * 7 coul be parsed 2 + 5 first or 5 * 7 first. Yacc Calculator Example: Declarations %{ # include <stdio.h> # include <ctype.h> int regs[26]; int base; Directly included C code list is our start symbol; a list of one-line statements / expressions. %} %start list %token DIGIT LETTER %left '+' '-' %left '*' '/' '%' %left UMINUS DIGIT & LETTER are tokens; (other tokens use ASCII codes, as in '+', '=', etc) /* precedence for unary minus */ Precedence and associativity (left) of operators: +, - have lowest precedence Yacc Calculator Example: Rules %% /* begin rules section */ list : /* empty */ | list stat '\n' | list error '\n' { yyerrok; } ; list: a list of one-line statements / expressions. Error handling allows a statement to be corrupt, but list continues with next statement. statement: expression to calculate, or assignment stat : expr { printf( "%d\n", $1 ); } | LETTER '=' expr { regs[$1] = $3; } ; number: made up of digits (tokenizer should handle this, but this is a simple example). number: DIGIT { $$ = $1; base = ($1==0) ? 8 : 10; } | number DIGIT { $$ = base * $1 + $2; } ; Yacc Calculator Example: Rules, cont'd expr : | | | | | | | ; '(' expr ')' { $$ = $2; } expr '+' expr { $$ = $1 + $3; } expr '-' expr { $$ = $1 - $3; } expr '*' expr { $$ = $1 * $3; } expr '/' expr { $$ = $1 / $3; } '-' expr %prec UMINUS { $$ = - $2; } LETTER { $$ = regs[$1]; } number Unary minus Letter: Register/var Yacc Calculator Example: Programs (C Code) %% /* start of programs */ yylex() { /* lexical analysis routine */ /* returns LETTER for a lower case letter, yylval = 0 through 25 */ /* return DIGIT for a digit, yylval = 0 through 9 */ /* all other characters are returned immediately */ int c; while( (c=getchar()) == ' ' ) {/* skip blanks */ } /* c is now nonblank */ if( islower( c ) ) { yylval = c - 'a'; return ( LETTER ); } if( isdigit( c ) ) { yylval = c - '0'; return( DIGIT ); } return( c ); }

Compiler design - Kanat Bolazar

Related documents

Products

Support

Compiler design - Kanat Bolazar

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib