Outline Introduction to Parsing Ambiguity and Syntax Errors • Regular languages revisited • Parser overview • Context-free grammars (CFG’s) • Derivations • Ambiguity • Syntax errors Compiler Design 1 (2011) Languages and Automata Limitations of Regular Languages • Formal languages are very important in CS Intuition: A finite automaton that runs long enough must repeat states • A finite automaton cannot remember # of times it has visited a particular state • because a finite automaton has finite memory – Especially in programming languages • Regular languages – The weakest formal languages widely used – Many applications – Only enough to store in which state it is – Cannot count, except up to a finite limit • Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0} • We will also study context-free languages Compiler Design 1 (2011) 2 3 Compiler Design 1 (2011) 4 The Functionality of the Parser Example • Input: sequence of tokens from lexer • If-then-else statement if (x == y) then z =1; else z = 2; • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output • Output: parse tree of the program IF-THEN-ELSE == ID Compiler Design 1 (2011) 5 Comparison with Lexical Analysis = ID ID = INT ID INT Compiler Design 1 (2011) The Role of the Parser Phase Input Output • Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens Lexer Sequence of characters Sequence of tokens • We need Parser Sequence of tokens Parse tree Compiler Design 1 (2011) 6 – A language for describing valid sequences of tokens – A method for distinguishing valid from invalid sequences of tokens 7 Compiler Design 1 (2011) 8 Context-Free Grammars CFGs (Cont.) • Many programming language constructs have a recursive structure • A CFG consists of – – – – • A STMT is of the form if COND then STMT else STMT while COND do STMT … , or , or Assuming X ∈ N the productions are of the form X→ε , or X → Y1 Y2 ... Yn where Yi ∈ N ∪T • Context-free grammars are a natural notation for this recursive structure Compiler Design 1 (2011) A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions 9 Compiler Design 1 (2011) Notational Conventions Examples of CFGs • In these lecture notes A fragment of our example language (simplified): – Non-terminals are written upper-case – Terminals are written lower-case – The start symbol is the left-hand side of the first production Compiler Design 1 (2011) 10 STMT → if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int 11 Compiler Design 1 (2011) 12 Examples of CFGs (cont.) The Language of a CFG Grammar for simple arithmetic expressions: Read productions as replacement rules: E→ ⏐ ⏐ ⏐ X → Y1 ... Yn E*E E+E (E) id Compiler Design 1 (2011) Means X can be replaced by Y1 ... Yn X→ε Means X can be erased (replaced with empty string) 13 Compiler Design 1 (2011) 14 Key Idea The Language of a CFG (Cont.) (1) Begin with a string consisting of the start symbol “S” More formally, we write X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n (2) Replace any non-terminal X in the string by a right-hand side of some production if there is a production X → Y1 LYn X i → Y1 LYm (3) Repeat (2) until there are no non-terminals in the string Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16 The Language of a CFG (Cont.) Write The Language of a CFG ∗ X 1 L X n → Y1 LYm if { Let G be a context-free grammar with start symbol S. Then the language of G is: } ∗ a1 K an | S → a1 K an and every ai is a terminal X 1 L X n → L → L → Y1 LYm in 0 or more steps Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18 Terminals Examples • Terminals are called so because there are no rules for replacing them L(G) is the language of the CFG G Strings of balanced parentheses • Once generated, terminals are permanent i i Two grammars: • Terminals ought to be tokens of the language Compiler Design 1 (2011) {( ) | i ≥ 0} S S 19 → (S ) → ε Compiler Design 1 (2011) OR S → (S ) | ε 20 Example Example (Cont.) A fragment of our example language (simplified): Some elements of the our language STMT → if COND then STMT ⏐ if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int COND → (id == id) ⏐ (id != id) Compiler Design 1 (2011) id = int if (id == id) then id = int else id = int while (id != id) do id = int while (id == id) do while (id != id) do id = int if (id != id) then if (id == id) then id = int else id = int 21 Compiler Design 1 (2011) Arithmetic Example Notes Simple arithmetic expressions: The idea of a CFG is a big step. But: E → E+E | E ∗ E | (E) | id Some elements of the language: id • Membership in a language is just “yes” or “no”; we also need the parse tree of the input id + id (id) id ∗ id (id) ∗ id id ∗ (id) Compiler Design 1 (2011) 22 • Must handle errors gracefully • Need an implementation of CFG’s (e.g., yacc) 23 Compiler Design 1 (2011) 24 More Notes Derivations and Parse Trees • Form of the grammar is important A derivation is a sequence of productions – Many grammars generate the same language – Parsing tools are sensitive to the grammar S →L →L →L A derivation can be drawn as a tree Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice Compiler Design 1 (2011) – Start symbol is the tree’s root – For a production X → Y1 LYn add children to node X 25 Derivation Example Y1 LYn Compiler Design 1 (2011) 26 Derivation Example (Cont.) • Grammar E → E+E | E ∗ E | (E) | id • String id ∗ id + id Compiler Design 1 (2011) E E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E → id ∗ id + id 27 Compiler Design 1 (2011) + E E id * E E id id 28 Derivation in Detail (1) Derivation in Detail (2) E E E + E → E+E E Compiler Design 1 (2011) 29 Derivation in Detail (3) Compiler Design 1 (2011) 30 Derivation in Detail (4) E E → E+E → E ∗ E+E Compiler Design 1 (2011) E + E E * E E → E+E E → E ∗ E+E → id ∗ E + E E 31 Compiler Design 1 (2011) + E E * E E id 32 Derivation in Detail (5) Derivation in Detail (6) E E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E + E E id * → → → → → E E id Compiler Design 1 (2011) E E 33 E+E E ∗ E+E id ∗ E + E id ∗ id + E id ∗ id + id + E E id * E id id Compiler Design 1 (2011) 34 Notes on Derivations Left-most and Right-most Derivations • A parse tree has • The example is a left-most derivation – Terminals at the leaves – Non-terminals at the interior nodes – At each step, replace the left-most non-terminal • An in-order traversal of the leaves is the original input • There is an equivalent notion of a right-most derivation • The parse tree shows the association of operations, the input string does not Compiler Design 1 (2011) E E → → → → E+E E+id E ∗ E + id E ∗ id + id → id ∗ id + id 35 Compiler Design 1 (2011) 36 Right-most Derivation in Detail (1) Right-most Derivation in Detail (2) E E E + E → E+E E Compiler Design 1 (2011) 37 Right-most Derivation in Detail (3) Compiler Design 1 (2011) 38 Right-most Derivation in Detail (4) E E → E+E → E+id Compiler Design 1 (2011) E E + E E E → E+E → E+id → E ∗ E + id id 39 Compiler Design 1 (2011) + E E * E E id 40 Right-most Derivation in Detail (5) Right-most Derivation in Detail (6) E E → E+E → E+id → E ∗ E + id + E E * → E ∗ id + id E E E → → → → → E id id Compiler Design 1 (2011) 41 E+E E+id E ∗ E + id E ∗ id + id id ∗ id + id + E E id * E E id id Compiler Design 1 (2011) Derivations and Parse Trees Summary of Derivations • Note that right-most and left-most derivations have the same parse tree • We are not just interested in whether s ∈ L(G) 42 – We need a parse tree for s • The difference is just in the order in which branches are added • A derivation defines a parse tree – But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation Compiler Design 1 (2011) 43 Compiler Design 1 (2011) 44 Ambiguity Ambiguity (Cont.) • Grammar This string has two parse trees E → E + E | E * E | ( E ) | int E • String int * int + int E E + E E * E * E int int E + int Compiler Design 1 (2011) 45 E int int E int Compiler Design 1 (2011) Ambiguity (Cont.) Dealing with Ambiguity • A grammar is ambiguous if it has more than one parse tree for some string • There are several ways to handle ambiguity – Equivalently, there is more than one right-most or left-most derivation for some string • Most direct method is to rewrite grammar unambiguously • Ambiguity is bad – Leaves meaning of some programs ill-defined E→T+E|T T → int * T | int | ( E ) • Ambiguity is common in programming languages – Arithmetic expressions – IF-THEN-ELSE Compiler Design 1 (2011) 46 • Enforces precedence of * over + 47 Compiler Design 1 (2011) 48 Ambiguity: The Dangling Else The Dangling Else: Example • Consider the following grammar • The expression if C1 then if C2 then S3 else S4 has two parse trees S → if C then S | if C then S else S | OTHER if C1 • This grammar is also ambiguous if C2 if C1 S4 if S3 C2 S3 S4 • Typically we want the second form Compiler Design 1 (2011) 49 Compiler Design 1 (2011) 50 The Dangling Else: A Fix The Dangling Else: Example Revisited • else matches the closest unmatched then • We can describe this in the grammar • The expression if C1 then if C2 then S3 else S4 S → MIF | UIF C1 MIF → if C then MIF else MIF | OTHER UIF → if C then S | if C then MIF else UIF if C2 S3 C1 S4 • A valid parse tree (for a UIF) • Describes the same set of strings Compiler Design 1 (2011) if if /* all then are matched */ /* some then are unmatched */ 51 Compiler Design 1 (2011) S4 if C2 S3 • Not valid because the then expression is not a MIF 52 Ambiguity Precedence and Associativity Declarations • No general techniques for handling ambiguity • Instead of rewriting the grammar – Use the more natural (ambiguous) grammar – Along with disambiguating declarations • Impossible to convert automatically an ambiguous grammar to an unambiguous one • Most tools allow precedence and associativity declarations to disambiguate grammars • Used with care, ambiguity can simplify the grammar • Examples … – Sometimes allows more natural definitions – We need disambiguation mechanisms Compiler Design 1 (2011) 53 Compiler Design 1 (2011) 54 Associativity Declarations Precedence Declarations • Consider the grammar E → E + E | int • Ambiguous: two parse trees of int + int + int • Consider the grammar E → E + E | E * E | int E E E + + E int int – And the string int + int * int E E E E int int E + int E E E + E int E E int int E + int • Precedence declarations: %left + %left * • Left associativity declaration: %left + Compiler Design 1 (2011) + E int int * E 55 Compiler Design 1 (2011) E * E int 56 Error Handling Syntax Error Handling • Purpose of the compiler is • Error handler should – To detect non-valid programs – To translate the valid ones – Report errors accurately and clearly – Recover from an error quickly – Not slow down compilation of valid code • Many kinds of possible errors (e.g. in C) Error kind Lexical Syntax Semantic Correctness Example Detected by … …$… … x *% … … int x; y = x(3); … your favorite program • Good error handling is not easy to achieve Lexer Parser Type checker Tester/User Compiler Design 1 (2011) 57 Compiler Design 1 (2011) Approaches to Syntax Error Recovery Error Recovery: Panic Mode • From simple to complex • Simplest, most popular method – Panic mode – Error productions – Automatic local or global correction 58 • When an error is detected: – Discard tokens until one with a clear role is found – Continue from there • Not all are supported by all parser generators • Such tokens are called synchronizing tokens – Typically the statement or expression terminators Compiler Design 1 (2011) 59 Compiler Design 1 (2011) 60 Syntax Error Recovery: Panic Mode (Cont.) Syntax Error Recovery: Error Productions • Consider the erroneous expression • Idea: specify in the grammar known common mistakes • Essentially promotes common errors to alternative syntax • Example: (1 + + 2) + 3 • Panic-mode recovery: – Skip ahead to next integer and then continue – Write 5 x instead of 5 * x – Add the production E → … | E E • (ML)-Yacc: use the special terminal error to describe how much input to skip • Disadvantage E → int | E + E | ( E ) | error int | ( error ) – Complicates the grammar Compiler Design 1 (2011) 61 Syntax Error Recovery: Past and Present • Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic • Present – – – – Quick recompilation cycle Users tend to correct one error/cycle Complex error recovery is needed less Panic-mode seems enough Compiler Design 1 (2011) 63 Compiler Design 1 (2011) 62