Fall 2015-2016 Compiler Principles Lecture 3: Parsing part 2 Roman Manevich Ben-Gurion University Tentative syllabus Front End Intermediate Representation Optimizations Code Generation Scanning Operational Semantics Dataflow Analysis Register Allocation Top-down Parsing (LL) Lowering Loop Optimizations Instruction Selection Bottom-up Parsing (LR) 2 Previously • Role of syntax analysis • Context-free grammars refresher • Top-down (predictive) parsing – Recursive descent 3 Functions for nonterminals E LIT | (E OP E) | not E LIT true | false OP and | or | xor E() { if (current {TRUE, FALSE}) else if (current == LPAREN) else if (current == NOT) else LIT(); match(LPARENT); E(); OP(); E(); match(RPAREN); match(NOT); E(); error; } LIT() { if (current == TRUE) else if (current == FALSE) else match(TRUE); match(FALSE); error; } OP() { if (current == AND) else if (current == OR) else if (current == XOR) else match(AND); match(OR); match(XOR); error; } 4 Technical challenges with recursive descent 5 Recursive descent: problem 1 term ID | indexed_elem indexed_elem ID [ expr ] • With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form ID[expr] 6 Recursive descent: problem 2 SAab Aa| int S() { return A() && match(token(‘a’)) && match(token(‘b’)); } int A() { return match(token(‘a’)) || 1; } What happens for input “ab”? What happens if you flip order of alternatives and try “aab”? 7 Recursive descent: problem 3 p. 127 E E - term | term int E() { return E() && match(token(‘-’)) && term(); } What happens when we execute this procedure? Recursive descent parsers cannot handle left-recursive grammars 8 Agenda • Predicting productions via FIRST/FOLLOW/NULLABLE sets • Handling conflicts • LL(k) via pushdown automata 9 How do we predict? E LIT | (E OP E) | not E LIT true | false OP and | or | xor • How can we decide which production of ‘E’ to take? 10 FIRST sets • For a nonterminal A, FIRST(A) is the set of terminals that can start in a sentence derived from A – Formally: FIRST(A) = {t | A * t ω} • For a sentential form α, FIRST(α) is the set of terminals that can start in a sentence derived from α – Formally: FIRST(α) = {t | α * t ω} 11 FIRST sets example E LIT | (E OP E) | not E LIT true | false OP and | or | xor • FIRST(E) = …? • FIRST(LIT) = …? • FIRST(OP) = …? 12 FIRST sets example E LIT | (E OP E) | not E LIT true | false OP and | or | xor • FIRST(E) = FIRST(LIT) FIRST(( E OP E )) • FIRST(LIT) = { true, false } • FIRST(OP) = {and, or, xor} FIRST(not E) • A set of recursive equations • How do we solve them? 13 Computing FIRST sets Assume no null productions (A ) 1. Initially, for all nonterminals A, set FIRST(A) = { t | A t ω for some ω } 2. Repeat the following until no changes occur: for each nonterminal A for each production A α1 | … | αk FIRST(A) = FIRST(α1) ∪ … ∪ FIRST(αk) • This is known as a fixed-point algorithm • We will see such iterative methods later in the course and learn to reason about them 14 Exercise: compute FIRST STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM 15 1. Initialization STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant 16 2. Iterate 1 STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- 17 2. Iterate 2 STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- id constant 18 2. Iterate 3 – fixed-point STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- id constant id constant 19 Reasoning about the algorithm Assume no null productions (A ) 1. Initially, for all nonterminals A, set FIRST(A) = { t | A t ω for some ω } 2. Repeat the following until no changes occur: for each nonterminal A for each production A α1 | … | αk FIRST(A) = FIRST(α1) ∪ … ∪ FIRST(αk) • Is the algorithm correct? • Does it terminate? (complexity) 20 Reasoning about the algorithm • Termination: • Correctness: 21 LL(1) Parsing of grammars without epsilon productions 22 Using FIRST sets • Assume G has no epsilon productions and for every non-terminal X and every pair of productions X and X we have that FIRST() FIRST() = {} • No intersection between FIRST sets => can always pick a single rule 23 Using FIRST sets • In our Boolean expressions example – FIRST( LIT ) = { true, false } – FIRST( ( E OP E ) ) = { ‘(‘ } – FIRST( not E ) = { not } • If the FIRST sets intersect, may need longer lookahead – LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens – LL(1) is an important and useful class • What if there are epsilon productions? 24 Extending LL(1) Parsing for epsilon productions 25 FIRST, FOLLOW, NULLABLE sets • For each non-terminal X • FIRST(X) = set of terminals that can start in a sentence derived from X – FIRST(X) = {t | X * t ω} • NULLABLE(X) if X * • FOLLOW(X) = set of terminals that can follow X in some derivation – FOLLOW(X) = {t | S * X t } 26 Computing the NULLABLE set • Lemma: NULLABLE(1 … k) = NULLABLE(1) … NULLABLE(k) 1. Initially NULLABLE(X) = false 2. For each non-terminal X if exists a production X then NULLABLE(X) = true 3. Repeat for each production Y 1 … k if NULLABLE(1 … k) then NULLABLE(Y) = true until NULLABLE stabilizes 27 Exercise: compute NULLABLE SAab Aa| BAB|C Cb| NULLABLE(S) = NULLABLE(A) NULLABLE(b) NULLABLE(A) = NULLABLE(a) NULLABLE(B) = NULLABLE(A) NULLABLE(C) NULLABLE(C) = NULLABLE(b) NULLABLE(a) NULLABLE() NULLABLE(B) NULLABLE() 28 FIRST with epsilon productions • How do we compute FIRST(1 … k) when epsilon productions are allowed? – FIRST(1 … k) = ? 29 FIRST with epsilon productions • How do we compute FIRST(1 … k) when epsilon productions are allowed? – FIRST(1 … k) = if not NULLABLE(1) then FIRST(1) else FIRST(1) FIRST (2 … k) 30 Exercise: compute FIRST SAcb Aa| NULLABLE(S) = NULLABLE(A) NULLABLE(b) NULLABLE(A) = NULLABLE(a) FIRST(S) = FIRST(A) FIRST(A) = FIRST(a) FIRST(cb) FIRST () NULLABLE(c) NULLABLE() FIRST(S) = FIRST(A) FIRST(A) = {a} {c} 31 FOLLOW sets p. 189 α Y then FOLLOW(Y) ? if NULLABLE() or = then FOLLOW(Y) ? • if X 32 FOLLOW sets p. 189 α Y then FOLLOW(Y) FIRST() if NULLABLE() or = then FOLLOW(Y) ? • if X 33 FOLLOW sets p. 189 α Y then FOLLOW(Y) FIRST() if NULLABLE() or = then FOLLOW(Y) FOLLOW(X) • if X 34 FOLLOW sets p. 189 α Y then FOLLOW(Y) FIRST() if NULLABLE() or = then FOLLOW(Y) FOLLOW(X) • Allows predicting epsilon productions: X when the lookahead token is in FOLLOW(X) • if X SAcb Aa| What should we predict for input “cb”? What should we predict for input “acb”? 35 LL(k) grammars 36 Conflicts • FIRST-FIRST conflict – X α and X and – If FIRST(α) FIRST(β) {} • FIRST-FOLLOW conflict – NULLABLE(X) – If FIRST(X) FOLLOW(X) {} 37 LL(1) grammars • A grammar is in the class LL(1) when it can be derived via: – – – – – Top-down derivation Scanning the input from left to right (L) Producing the leftmost derivation (L) With lookahead of one token For every two productions A α and A β we have FIRST(α) ∩ FIRST(β) = {} and if NULLABLE(A) then FIRST(A) FOLLOW(A) = {} • A language is said to be LL(1) when it has an LL(1) grammar 38 LL(k) grammars • Generalizes LL(1) for k lookahead tokens • Need to generalize FIRST and FOLLOW for k lookahead tokens 39 Agenda • Predicting productions via FIRST/FOLLOW/NULLABLE sets • Handling conflicts • LL(k) via pushdown automata 40 Handling conflicts 41 Back to problem 1 term ID | indexed_elem indexed_elem ID [ expr ] • FIRST(term) = { ID } • FIRST(indexed_elem) = { ID } • FIRST-FIRST conflict 42 Solution: left factoring • Rewrite the grammar to be in LL(1) term ID | indexed_elem indexed_elem ID [ expr ] New grammar is more complex – has epsilon production term ID after_ID After_ID [ expr ] | Intuition: just like factoring in algebra: x*y + x*z into x*(y+z) 43 Exercise: apply left factoring S if E then S else S | if E then S |T 44 Exercise: apply left factoring S if E then S else S | if E then S |T S if E then S S’ |T S’ else S | 45 Back to problem 2 SAab Aa| • FIRST(S) = { a } • FIRST(A) = { a } FOLLOW(S) = { } FOLLOW(A) = { a } • FIRST-FOLLOW conflict 46 Solution: substitution SAab Aa| Substitute A in S Saab|ab 47 Solution: substitution SAab Aa| Substitute A in S Saab|ab Left factoring S a after_A after_A a b | b 48 Back to problem 3 E E - term | term • Left recursion cannot be handled with a bounded lookahead • What can we do? 49 Left recursion removal p. 130 N Nα | β N βN’ N’ αN’ | G1 • L(G1) = β, βα, βαα, βααα, … • L(G2) = same For our 3rd example: E E - term | term G2 Can be done algorithmically. Problem 1: grammar becomes mangled beyond recognition Problem 2: grammar may not be LL(1) E term TE | term TE - term TE | 50 Recap • Given a grammar • Compute for each non-terminal – NULLABLE – FIRST using NULLABLE – FOLLOW using FIRST and NULLABLE • Compute FIRST for each sentential form appearing on right-hand side of a production • Check for conflicts – If exist: attempt to remove conflicts by rewriting grammar 51 Agenda • Predicting productions via FIRST/FOLLOW/NULLABLE sets • Handling conflicts • LL(k) via pushdown automata 52 LL(1) parsing: the automata approach By MG (talk · contribs) (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons 53 Marking “end-of-file” • Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G’ with a new start non-terminal S‘ and a new production rule S’ S $ where $ is not part of the set of tokens • To parse an input α with G’ we change it into α$ • Simplifies top-down parsing with null productions and LR parsing 54 Another convention • We will assume that all productions have been consecutively numbered (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) 55 LL(1) Parsers • Recursive Descent – Manual construction (parsing combinators make this easier, but…) – Uses recursion • Wanted – A parser that can be generated automatically – Does not use recursion 56 LL(1) parsing via pushdown automata • Pushdown automaton uses – Input stream – Prediction stack – Parsing table • Nonterminal token production rule • Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t • Essentially, classic conversion from CFG to PDA – The only difference is that we replace nondeterministic choice with the parsing table 57 Model of non-recursive predictive parser a Stack X Y + b Predictive Parsing program $ Output Z $ Parsing Table 58 LL(1) parsing algorithm • Set stack=S$ • While true – Prediction • • • • When top of stack is nonterminal N pop N, lookup table[N,t] If table[N,t] is not empty, push table[N,t] on prediction stack Otherwise: return syntax error – Match • When top of prediction stack is a terminal t, must be equal to next input token t’. If (t = t’), pop t and consume t’. If (t ≠ t’): return syntax error – End • When prediction stack is empty • If input is empty at that point: return success • Otherwise: return syntax error 59 Example transition table (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor ( FIRST(E) Which rule should be used Nonterminals Input tokens ( E LIT OP 2 ) not true false 3 1 1 4 5 and or xor 6 7 8 $ 60 Running parser example A aacbb$ aAb | c Input suffix Stack content Move aacbb$ A$ predict(A,a) = A aacbb$ aAb$ match(a,a) acbb$ Ab$ predict(A,a) = A acbb$ aAbb$ match(a,a) cbb$ Abb$ predict(A,c) = A cbb$ cbb$ match(c,c) bb$ bb$ match(b,b) b$ b$ match(b,b) $ $ match($,$) – success a A A aAb b aAb aAb c c A c 61 Illegal input example A abcbb$ aAb | c Input suffix Stack content Move abcbb$ A$ predict(A,a) = A abcbb$ aAb$ match(a,a) bcbb$ Ab$ predict(A,b) = ERROR a A A aAb b aAb c A c 62 Creating the prediction table • • • • Let G be an LL(1) grammar Compute FIRST/NULLABLE/FOLLOW Check for conflicts For non-terminal N and token t predict: … 63 Top-down parsing summary • • • • Recursive descent LL(k) grammars LL(k) parsing with pushdown automata Cannot deal with left recursion – Left-recursion removal might result with complicated grammar 64 Next lecture: Bottom-up parsing