Compilation 0368-3133 Lecture 3: Syntax Analysis Top Down parsing Bottom Up parsing Noam Rinetzky 1 The Real Anatomy of a Compiler txt Source text Process text input characters Lexical Analysis tokens Syntax Analysis AST Sem. Analysis Annotated AST Intermediate code generation IR Intermediate code optimization IR Code generation Symbolic Instructions Target code optimization SI Machine code generation MI Write executable output exe Executable code 2 Context free grammars (CFG) G = (V,T,P,S) • V – non terminals (syntactic variables) • T – terminals (tokens) • P – derivation rules – Each rule of the form V (T ∪ V)* • S – start symbol 3 Pushdown Automata (PDA) • Nondeterministic PDAs define all CFLs • Deterministic PDAs model parsers. – Most programming languages have a deterministic PDA – Efficient implementation 4 CFG terminology SS;S S id := E E id E num EE+E Symbols: Terminals (tokens): ; := ( ) id num print Non-terminals: S E L Start non-terminal: S Convention: the non-terminal appearing in the first derivation rule Grammar productions (rules) Nμ 5 CFG terminology • Derivation - a sequence of replacements of non-terminals using the derivation rules • Language - the set of strings of terminals derivable from the start symbol • Sentential form - the result of a partial derivation – May contain non-terminals 6 Derivations • Show that a sentence ω is in a grammar G – Start with the start symbol – Repeatedly replace one of the non-terminals by a right-hand side of a production – Stop when the sentence contains only terminals • Given a sentence αNβ and rule Nµ αNβ => αµβ • ω is in L(G) if S =>* ω 7 Parse trees: Traces of Derivations • Tree nodes are symbols, children ordered left-to-right • Each internal node is non-terminal – Children correspond to one of its productions N N µ1 … µk µ1 … µk • Root is start non-terminal • Leaves are tokens • Yield of parse tree: left-to-right walk over leaves 8 Parse Tree S S S ; S id := E ; S S id := id ; S id := id ;id := E S E id := E id := id ;id := E + E id := id ;id := E + id E id := id ;id := id + id x:= z ; y := x E + z id := id ; id + id 9 Ambiguity S S ; S S id := E | … E id | E + E | E * E | … x := y+z*w S id S := E E + E id E * id id := E * E E E id id + E E id id 10 Leftmost/rightmost Derivation • Leftmost derivation – always expand leftmost non-terminal • Rightmost derivation – Always expand rightmost non-terminal • Allows us to describe derivation by listing the sequence of rules – always know what a rule is applied to 11 Broad kinds of parsers • Parsers for arbitrary grammars – Earley’s method, CYK method – Usually, not used in practice (though might change) • Top-down parsers – Construct parse tree in a top-down matter – Find the leftmost derivation • Bottom-up parsers – Construct parse tree in a bottom-up manner – Find the rightmost derivation in a reverse order 12 Top-Down Parsing: Predictive parsing • Recursive descent • LL(k) grammars 13 Predictive parsing • Given a grammar G and a word w attempt to derive w using G • Idea – Apply production to leftmost nonterminal – Pick production rule based on next input token • General grammar – More than one option for choosing the next production based on a token • Restricted grammars (LL) – Know exactly which single rule to apply – May require some lookahead to decide 14 Recursive descent parsing • Define a function for every nonterminal • Every function work as follows – Find applicable production rule – Terminal function checks match with next input token – Nonterminal function calls (recursively) other functions • If there are several applicable productions for a nonterminal, use lookahead 15 Recursive descent void A() { choose an A-production, A X1X2…Xk; for (i=1; i≤ k; i++) { if (Xi is a nonterminal) call procedure Xi(); elseif (Xi == current) advance input; else report error; } } • How do you pick the right A-production? • Generally – try them all and use backtracking • In our case – use lookahead 16 Problem 1: productions with common prefix term ID | indexed_elem indexed_elem ID [ expr ] • The function for indexed_elem will never be tried… – What happens for input of the form ID[expr] 17 Problem 2: null productions SAab Aa| int S() { return A() && match(token(‘a’)) && match(token(‘b’)); } int A() { return match(token(‘a’)) || 1; } What happens for input “ab”? What happens if you flip order of alternatives and try “aab”? 18 Problem 3: left recursion p. 127 E E - term | term int E() { return E() && match(token(‘-’)) && term(); } What happens with this procedure? Recursive descent parsers cannot handle left-recursive grammars 19 Constructing an LL(1) Parser 20 FIRST sets X YY | Z Z | Y Z | 1 Y Y4|ℇ Z2 L(Z) = {2} L(Y) = {4, ℇ} L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1} 21 FIRST sets X YY | Z Z | Y Z | 1 Y Y4|ℇ Z2 L(Z) = {2} L(Y) = {4, ℇ} L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1} 22 FIRST sets • FIRST(X) = { t | X * t β} ∪{ℇ | X * ℇ} – FIRST(X) = all terminals that α can appear as first in some derivation for X • + ℇ if can be derived from X • Example: – FIRST( LIT ) = { true, false } – FIRST( ( E OP E ) ) = { ‘(‘ } – FIRST( not E ) = { not } E LIT | (E OP E) | not E LIT true | false OP and | or | xor 23 Computing FIRST sets • FIRST (t) = { t } // “t” non terminal • ℇ∈ FIRST(X) if – X ℇ or – X A1 .. Ak and ℇ∈ FIRST(Ai) i=1…k • FIRST(α) ⊆ FIRST(X) if – X A1 .. Ak α and ℇ∈ FIRST(Ai) i=1…k 25 Computing FIRST sets • Assume no null productions A 1. 2. Initially, for all nonterminals A, set FIRST(A) = { t | A tω for some ω } Repeat the following until no changes occur: for each nonterminal A for each production A Bω set FIRST(A) = FIRST(A) ∪ FIRST(B) • This is known as fixed-point computation 26 FIRST sets computation example STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM 27 1. Initialization STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant 28 2. Iterate 1 STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- 29 2. Iterate 2 STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- id constant 30 2. Iterate 3 – fixed-point STMT if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM id | constant STMT EXPR TERM if while zero? Not ++ -- id constant zero? Not ++ -- id constant id constant 31 FOLLOW sets p. 189 • What do we do with nullable () productions? – ABCD BC – Use what comes afterwards to predict the right production • For every production rule A α – FOLLOW(A) = set of tokens that can immediately follow A • Can predict the alternative Ak for a non-terminal N when the lookahead token is in the set – FIRST(Ak) (if Ak is nullable then FOLLOW(N)) 32 FOLLOW sets: Constraints • $ ∈ FOLLOW(S) • FIRST(β) – {ℇ} ⊆ FOLLOW(X) – For each A α X β • FOLLOW(A) ⊆ FOLLOW(X) – For each A α X β and ℇ ∈ FIRST(β) 33 Example: FOLLOW sets • E TX • T (E) | int Y X+ E | ℇ Y*T|ℇ Terminal + ( * ) int FOLLOW int, ( int, ( int, ( _, ), $ *, ), +, $ Non. E Term. FOLLOW ), $ T X Y +, ), $ $, ) _, ), $ 34 Prediction Table • Aα • T[A,t] = α if t ∈FIRST(α) • T[A,t] = α if ℇ ∈ FIRST(α) and t ∈ FOLLOW(A) – t can also be $ • T is not well defined the grammar is not LL(1) 35 LL(k) grammars • A grammar is in the class LL(K) when it can be derived via: – – – – Top-down derivation Scanning the input from left to right (L) Producing the leftmost derivation (L) With lookahead of k tokens (k) • A language is said to be LL(k) when it has an LL(k) grammar 36 LL(1) grammars • A grammar is in the class LL(K) iff – For every two productions A α and A β we have • FIRST(α) ∩ FIRST(β) = {} // including • If ∈ FIRST(α) then FIRST(β) ∩ FOLLOW(A) = {} • If ∈ FIRST(β) then FIRST(α) ∩ FOLLOW(A) = {} 37 Problem: Non LL Grammars SAab Aa| bool S() { return A() && match(token(‘a’)) && match(token(‘b’)); } bool A() { return match(token(‘a’)) || true; } What happens for input “ab”? What happens if you flip order of alternatives and try “aab”? 38 Problem: Non LL Grammars SAab Aa| • FIRST(S) = { a } • FIRST(A) = { a, } FOLLOW(S) = { $ } FOLLOW(A) = { a } • FIRST/FOLLOW conflict 39 Back to problem 1 term ID | indexed_elem indexed_elem ID [ expr ] • FIRST(term) = { ID } • FIRST(indexed_elem) = { ID } • FIRST/FIRST conflict 40 Solution: left factoring • Rewrite the grammar to be in LL(1) term ID | indexed_elem indexed_elem ID [ expr ] term ID after_ID After_ID [ expr ] | Intuition: just like factoring x*y + x*z into x*(y+z) 41 Left factoring – another example S if E then S else S | if E then S |T S if E then S S’ |T S’ else S | 42 Back to problem 2 SAab Aa| • FIRST(S) = { a } • FIRST(A) = { a , } FOLLOW(S) = { } FOLLOW(A) = { a } • FIRST/FOLLOW conflict 43 Solution: substitution SAab Aa| Substitute A in S Saab|ab Left factoring S a after_A after_A a b | b 44 Back to problem 3 E E - term | term • Left recursion cannot be handled with a bounded lookahead • What can we do? 45 Left recursion removal N Nα | β p. 130 N βN’ N’ αN’ | G1 • L(G1) = β, βα, βαα, βααα, … • L(G2) = same G2 Can be done algorithmically. Problem: grammar becomes mangled beyond recognition For our 3rd example: E E - term | term E term TE | term TE - term TE | 46 LL(k) Parsers • Recursive Descent – Manual construction – Uses recursion • Wanted – A parser that can be generated automatically – Does not use recursion 47 LL(k) parsing via pushdown automata • Pushdown automaton uses – Prediction stack – Input stream – Transition table • nonterminals x tokens -> production alternative • Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t 48 LL(k) parsing via pushdown automata • Two possible moves – Prediction • When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error – Match • When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t. If (t ≠ T) syntax error • Parsing terminates when prediction stack is empty – If input is empty at that point, success. Otherwise, syntax error 49 Example transition table Nonterminals (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor ( E LIT OP 2 ) Which rule should be used Input tokens not true false 3 1 1 4 5 and or xor 6 7 8 $ 50 Model of non-recursive predictive parser a Stack X Y + b Predictive Parsing program $ Output Z $ Parsing Table 51 Running parser example A aAb | c aacbb$ Input suffix Stack content Move aacbb$ A$ predict(A,a) = A aAb aacbb$ aAb$ match(a,a) acbb$ Ab$ predict(A,a) = A aAb acbb$ aAbb$ match(a,a) cbb$ Abb$ predict(A,c) = A c cbb$ cbb$ match(c,c) bb$ bb$ match(b,b) b$ b$ match(b,b) $ $ match($,$) – success a A A aAb b c Ac 52 Erorrs 53 Handling Syntax Errors • • • • Report and locate the error Diagnose the error Correct the error Recover from the error in order to discover more errors – without reporting too many “strange” errors 54 Error Diagnosis • Line number – may be far from the actual error • The current token • The expected tokens • Parser configuration 55 Error Recovery • Becomes less important in interactive environments • Example heuristics: – Search for a semi-column and ignore the statement – Try to “replace” tokens for common errors – Refrain from reporting 3 subsequent errors • Globally optimal solutions – For every input w, find a valid program w’ with a “minimal-distance” from w 56 Illegal input example A aAb | c abcbb$ Input suffix Stack content Move abcbb$ A$ predict(A,a) = A aAb abcbb$ aAb$ match(a,a) bcbb$ Ab$ predict(A,b) = ERROR a A A aAb b c Ac 57 Error handling in LL parsers S a c | b S c$ Input suffix Stack content Move c$ S$ predict(S,c) = ERROR • Now what? – Predict b S anyway “missing token b inserted in line XXX” S a b Sac SbS c 58 Error handling in LL parsers S a c | b S c$ Input suffix Stack content Move bc$ S$ predict(b,c) = S bS bc$ bS$ match(b,b) c$ S$ Looks familiar? • Result: infinite loop S a b Sac SbS c 59 Error handling and recovery • x = a * (p+q * ( -b * (r-s); • Where should we report the error? • The valid prefix property 60 The Valid Prefix Property • For every prefix tokens – t1, t2, …, ti that the parser identifies as legal: • there exists tokens ti+1, ti+2, …, tn such that t1, t2, …, tn is a syntactically valid program • If every token is considered as single character: – For every prefix word u that the parser identifies as legal there exists w such that u.w is a valid program 61 Recovery is tricky • Heuristics for dropping tokens, skipping to semicolon, etc. 62 Building the Parse Tree 63 Adding semantic actions • Can add an action to perform on each production rule • Can build the parse tree – Every function returns an object of type Node – Every Node maintains a list of children – Function calls can add new children 64 Building the parse tree Node E() { result = new Node(); result.name = “E”; if (current {TRUE, FALSE}) // E LIT result.addChild(LIT()); else if (current == LPAREN) // E ( E OP E ) result.addChild(match(LPAREN)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E not E result.addChild(match(NOT)); result.addChild(E()); else error; return result; } 65 Bottom-up parsing 67 Intuition: Bottom-Up Parsing • Begin with the user's program • Guess parse (sub)trees • Check if root is the start symbol 68 Bottom-up parsing Unambiguous grammar EE*T ET TT+F TF F id F num F(E) 1 + 2 * 3 69 Bottom-up parsing Unambiguous grammar EE*T ET TT+F TF F id F num F(E) F 1 + 2 * 3 70 Unambiguous grammar EE*T ET TT+F TF F id F num F(E) Bottom-up parsing T T F 1 F + 2 F * 3 71 Top-Down vs Bottom-Up • Top-down (predict match/scan-complete ) already read… to be read… A aAb|c aacbb$ A a A b a A b c 72 Top-Down vs Bottom-Up • Top-down (predict match/scan-complete ) already read… to be read… A aAb|c aacbb$ A a A b a A b c Bottom-up (shift reduce) A a A b a A b c 73 Bottom-up parsing: LR(k) Grammars • A grammar is in the class LR(K) when it can be derived via: – – – – Bottom-up derivation Scanning the input from left to right (L) Producing the rightmost derivation (R) With lookahead of k tokens (k) • A language is said to be LR(k) if it has an LR(k) grammar • The simplest case is LR(0), which we will discuss 74 Terminology: Reductions & Handles • The opposite of derivation is called reduction – Let A α be a production rule – Derivation: βAµ βαµ – Reduction: βαµ βAµ • A handle is the reduced substring – α is the handles for βαµ 75 Goal: Reduce the Input to the Start Symbol E→E*B|E+B|B B→0|1 Example: 0+0*1 B+0*1 E+0*1 E+B*1 E*1 E*B E E * B E E + B B 1 0 0 Go over the input so far, and upon seeing a right-hand side of a rule, “invoke” the rule and replace the right-hand side with the left-hand side (reduce) 76 Use Shift & Reduce In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules. 77 Use Shift & Reduce Example: “0+0*1” In each stage, we shift a symbol from the input to the stack, or E → E * B | E + B | B B→0|1 reduce according to one of the rules. Stack E * E E B 0 + B 0 B 1 0 B E E+ E+0 E+B E E* E*1 E*B E Input 0+0*1$ +0*1$ +0*1$ +0*1$ 0*1$ *1$ *1$ *1$ 1$ $ $ $ action shift reduce reduce shift shift reduce reduce shift shift reduce reduce accept 78 How does the parser know what to do? token stream ( ( 23 + 7 ) * x ) LP LP Num OP Num RP OP Id RP Input Action Table Parser Stack Output Goto table Op(*) Op(+) Id(b) 79 Num(23) Num(7) How does the parser know what to do? • A state will keep the info gathered on handle(s) – A state in the “control” of the PDA – Also (part of) the stack alpha beit Set of LR(0) items • A table will tell it “what to do” based on current state and next token – The transition function of the PDA • A stack will records the “nesting level” – Prefixes of handles 80 LR item Already matched To be matched Input N αβ Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β 81 Example: LR(0) Items • All items can be obtained by placing a dot at every position for every production: Grammar (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) LR(0) items 1: S E$ 2: S E $ 3: S E $ 4: E T 5: E T 6: E E + T 7: E E + T 8: E E + T 9: E E + T 10: T i 11: T i 12: T (E) 13: T ( E) 14: T (E ) 15: T (E) 82 LR(0) items N αβ Shift Item N αβ Reduce Item 83 States and LR(0) Items E→E*B|E+B|B B→0|1 • The state will “remember” the potential derivation rules given the part that was already identified • For example, if we have already identified E then the state will remember the two alternatives: (1) E → E * B, (2) E → E + B • Actually, we will also remember where we are in each of them: (1) E → E ● * B, (2) E → E ● + B • A derivation rule with a location marker is called LR(0) item. • The state is actually a set of LR(0) items. E.g., q13 = { E → E ● * B , E → E ● + B} 84 Intuition • Gather input token by token until we find a right-hand side of a rule and then replace it with the non-terminal on the left hand side – Going over a token and remembering it in the stack is a shift • Each shift moves to a state that remembers what we’ve seen so far – A reduce replaces a string in the stack with the non-terminal that derives it 85 Model of an LR parser Input Stack state 0 symbol T id + id + id LR Parsing program $ Output 2 Terminals and Non-terminals + 7 action goto id 5 86 LR parser stack • Sequence made of state, symbol pairs • For instance a possible stack for the grammar SE$ ET EE+T T id T(E) could be: 0 T 2 + 7 id 5 Stack grows this way 87 Form of LR parsing table state 0 1 . . . non-terminals terminals Shift/Reduce actions Goto part acc rk sn shift state n gm error reduce by rule k accept goto state m 88 LR parser table example STATE action id 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 89 Shift move Input Stack q . . . … a … LR Parsing program action $ Output goto • If action[q, a] = sn 90 Result of shift Input Stack n a … a … LR Parsing program $ Output q . . . action goto • If action[q, a] = sn 91 Reduce move Input Stack qn 2*|β| … q … a … LR Parsing program action $ Output goto … • If action[qn, a] = rk • Production: (k) A β • If β= σ1… σn Top of stack looks like q1 σ1… qn σn • goto[q, A] = qm 92 Result of reduce move Input Stack 2*|β| qm … a … LR Parsing program $ Output A q action goto … • If action[qn, a] = rk • Production: (k) A β • If β= σ1… σn Top of stack looks like q1 σ1… qn σn • goto[q, A] = qm 93 Accept move Input … Stack a $ LR Parsing program q . . . action Output goto If action[q, a] = accept parsing completed 94 Error move Input Stack q . . . … a … LR Parsing program action $ Output goto If action[q, a] = error (usually empty) parsing discovered a syntactic error 95 Example Z E $ E T | E + T T i | ( E ) 96 Example: parsing with LR items Z E $ E T | E + T T i | ( E ) Z E E T T E $ T E + T i ( E ) i + i $ Why do we need these additional LR items? Where do they come from? What do they mean? 97 -closure • Given a set S of LR(0) items • If P αNβ is in S • then for each rule N α in the grammar S must also contain N α -closure({Z E $}) = { Z E E Z E $ E T | E + T T T i | ( E ) T E $, T, E + T, i , ( E ) } 98 Example: parsing with LR items i + i $ Remember position from which we’re trying to reduce ZE$ ET|E+T Ti|(E) Items denote possible future handles Z E $ E T E E + T T i T ( E ) 99 Example: parsing with LR items i + i $ ZE$ ET|E+T Ti|(E) Match items with current token Z E $ E T E E + T T i T ( E ) T i Reduce item! 100 Example: parsing with LR items T + i $ ZE$ ET|E+T Ti|(E) i Z E $ E T E E + T T i T ( E ) E T Reduce item! 101 Example: parsing with LR items E + i $ ZE$ ET|E+T Ti|(E) T i Z E $ E T E E + T T i T ( E ) E T Reduce item! 102 Example: parsing with LR items E + i $ ZE$ ET|E+T Ti|(E) T i Z E $ E T E E + T T i T ( E ) Z E$ E E+ T 103 Example: parsing with LR items E + i $ ZE$ ET|E+T Ti|(E) T i Z E $ E T E E + T T i T ( E ) Z E$ E E+ T E E+T T i T ( E ) 104 Example: parsing with LR items E + T T $ i ZE$ ET|E+T Ti|(E) i Z E $ E T E E + T T i T ( E ) Z E$ E E+ T E E+T T i T ( E ) 105 Example: parsing with LR items E T + T $ ZE$ ET|E+T Ti|(E) i i Reduce item! Z E $ E T E E + T T i T ( E ) Z E$ E E+ T E E+T E E+T T i T ( E ) 106 Example: parsing with LR items E ZE$ ET|E+T Ti|(E) $ E+T T i i Z E $ E T E E + T T i T ( E ) Z E$ E E+ T 107 Example: parsing with LR items E ZE$ ET|E+T Ti|(E) $ E+T T i Reduce item! i Z E $ E T E E + T T i T ( E ) Z E$ Z E$ E E+ T 108 Example: parsing with LR items Z ZE$ ET|E+T Ti|(E) E E $ E+T T Z E $ E T E E + T T i T ( E ) i Reduce item! i Z E$ Z E$ E E+ T 109 GOTO/ACTION tables empty – error move GOTO Table State i q0 q5 q1 + ( ) $ q7 q3 E T action q1 q6 shift q2 shift q2 q3 ACTION Table Z E$ q5 q7 q4 Shift q4 E E+T q5 T i q6 E T q7 q8 q9 q5 q7 q3 q8 q9 q6 shift shift T E 110 LR(0) parser tables • Two types of rows: – Shift row – tells which state to GOTO for current token – Reduce row – tells which rule to reduce (independent of current token) • GOTO entries are blank 111 LR parser data structures • Input – remainder of text to be processed • Stack – sequence of pairs N, qi – N – symbol (terminal or non-terminal) – qi – state at which decisions are made + input stack q0 i i $ q5 • Initial stack contains q0 112 LR(0) pushdown automaton • Two moves: shift and reduce • Shift move – – – – – Remove first token from input Push it on the stack Compute next state based on GOTO table Push new state on the stack If new state is error – report error input i stack q0 + i shift $ stack State i q0 q5 + ( q7 ) $ + input E T action q1 q6 shift q0 i i $ q5 113 LR(0) pushdown automaton • Reduce move – Using a rule N α – Symbols in α and their following states are removed from stack – New state computed based on GOTO table (using top of stack, before pushing N) – N is pushed on the stack – New state pushed on top of N + input stack q0 i i Reduce Ti $ q5 + input q0 stack State i q0 q5 + ( q7 ) $ E T action q1 q6 shift T i $ q6 114 GOTO/ACTION table State i q0 s5 q1 + ( ) $ s7 s3 T s1 s6 r1 r1 s2 q2 r1 q3 s5 q4 r3 r3 r3 r3 r3 r3 r3 q5 r4 r4 r4 r4 r4 r4 r4 q6 r2 r2 r2 r2 r2 r2 r2 q7 s5 s8 s6 r5 r5 q8 q9 (1) Z (2) E (3) E (4) T (5) T r1 E r1 E $ T E + T i ( E ) r1 s7 s4 s7 s3 r5 r1 r5 s9 r5 r5 r5 Warning: numbers mean different things! rn = reduce using rule number n sm = shift to state m 115 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S action id id + id $ s5 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 Initialize with state 0 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 116 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S action id id + id $ s5 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 Initialize with state 0 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 117 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 0 id 5 Input Action S id id + id $ s5 + id $ r4 action 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 118 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S id id + id $ s5 0 id 5 + id $ r4 action 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 pop id 5 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 119 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S id id + id $ s5 0 id 5 + id $ r4 action 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 push T 6 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 120 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S id id + id $ s5 0 id 5 + id $ r4 0T6 + id $ r2 action 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 121 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action S id id + id $ s5 0 id 5 + id $ r4 0T6 + id $ r2 0E1 + id $ s3 action 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 122 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action + id $ r4 0T6 + id $ r2 0E1 + id $ s3 id $ s5 action id id + id $ s5 0 id 5 0E1+3 S 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 123 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action + id $ r4 0T6 + id $ r2 0E1 + id $ s3 0 E 1 + 3 id 5 id $ s5 $ r4 action id id + id $ s5 0 id 5 0E1+3 S 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 124 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action + id $ r4 0T6 + id $ r2 0E1 + id $ s3 id $ s5 0 E 1 + 3 id 5 $ r4 0E1+3T4 $ r3 action id id + id $ s5 0 id 5 0E1+3 S 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 6 r2 r2 r2 r2 r2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 125 (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Parsing id+id$ Stack 0 Input Action + id $ r4 0T6 + id $ r2 0E1 + id $ s3 id $ s5 action id id + id $ s5 0 id 5 0E1+3 S 0 + s5 1 ( goto ) $ s7 s3 E T g1 g6 acc 2 3 s5 4 r3 r3 r3 r3 r3 5 r4 r4 r4 r4 r4 r2 r2 r2 r2 0 E 1 + 3 id 5 $ r4 0E1+3T4 $ r3 6 r2 0E1 $ s2 7 s5 8 9 s7 s7 s3 r5 g4 r5 g8 g6 s9 r5 r5 r5 126 Constructing an LR parsing table • Construct a (determinized) transition diagram from LR items • If there are conflicts – stop • Fill table entries from diagram 127 LR item Already matched To be matched Input N αβ Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β 128 Types of LR(0) items N αβ Shift Item N αβ Reduce Item 129 LR(0) automaton example shift state E T T q0 Z E$ E T E E + T T i T (E) Z E$ E E+ T $ q5 i q7 T (E) E T E E + T T i T (E) i T i + q3 q4 E E+T T i T (E) T E ( i q2 Z E$ T ( E q1 reduce state q6 + T (E) E E+T ) ( q8 q9 T (E) E E + T 130 Computing item sets • Initial set – Z is in the start symbol – -closure({ Z α | Z α is in the grammar } ) • Next set from a set S and the next symbol X – step(S,X) = { N αXβ | N αXβ in the item set S} – nextSet(S,X) = -closure(step(S,X)) 131 Operations for transition diagram construction • Initial = {S’ S$} • For an item set I Closure(I) = Closure(I) ∪ {X µ is in grammar| N αXβ in I} • Goto(I, X) = { N αXβ | N αXβ in I} 132 Initial example • Initial = {S E $} Grammar (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) 133 Closure example • Initial = {S E $} • Closure({S E $}) = { S E $ E T E E + T T id T ( E ) } Grammar (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) 134 Goto example Grammar (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) • Initial = {S E $} • Closure({S E $}) = { S E $ E T E E + T T id T ( E ) } • Goto({S E $ , E E + T, T id}, E) = {S E $, E E + T} 135 Constructing the transition diagram • Start with state 0 containing item Closure({S E $}) • Repeat until no new states are discovered – For every state p containing item set Ip, and symbol N, compute state q containing item set Iq = Closure(goto(Ip, N)) 136 LR(0) automaton example shift state E T T q0 Z E$ E T E E + T T i T (E) Z E$ E E+ T $ q5 i q7 T (E) E T E E + T T i T (E) i T i + q3 q4 E E+T T i T (E) T E ( i q2 Z E$ T ( E q1 reduce state q6 + T (E) E E+T ) ( q8 q9 T (E) E E + T 137 q0 Automaton construction example S E$ (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) Initialize 138 Automaton construction example (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) q0 S E$ E T E E + T T i T (E) apply Closure 139 Automaton construction example (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) q6 q0 E T T S E$ E T E E + T T i T (E) ( q5 i T i T (E) E T E E + T T i T (E) E q1 S E$ E E+ T 140 Automaton construction example (1) S E $ (2) E T (3) E E + T (4) T id (5) T ( E ) q6 q0 S E$ E T E E + T T i T (E) T E T ( non-terminal transition corresponds to goto action in parse table i q1 terminal Z transition E$ corresponds to shift E E+ T action in parse table i T i q3 q4 T E E + T E ( i E E+T T i T (E) + $ S E$ T (E) E T E E + T T i T (E) q5 E q2 q7 T + T (E) E E+T ) ( q8 q9 T (E) a single reduce item corresponds to reduce action 141 Are we done? • Can make a transition diagram for any grammar • Can make a GOTO table for every grammar • Cannot make a deterministic ACTION table for every grammar 142 LR(0) conflicts T q0 Z E$ E T E E + T T i T (E) T i[E] … ( i E … … q5 T i T i[E] Shift/reduce conflict ZE$ ET EE+T Ti T(E) T i[E] 143 LR(0) conflicts T q0 Z E$ E T E E + T T i T (E) T i[E] … ( i E … … q5 T i V i reduce/reduce conflict ZE$ ET EE+T Ti Vi T(E) 144 LR(0) conflicts • Any grammar with an -rule cannot be LR(0) • Inherent shift/reduce conflict – A – reduce item – P αAβ – shift item – A can always be predicted from P αAβ 145 Conflicts • Can construct a diagram for every grammar but some may introduce conflicts • shift-reduce conflict: an item set contains at least one shift item and one reduce item • reduce-reduce conflict: an item set contains two reduce items 146 LR variants • LR(0) – what we’ve seen so far • SLR(0) – Removes infeasible reduce actions via FOLLOW set reasoning • LR(1) – LR(0) with one lookahead token in items • LALR(0) – LR(1) with merging of states with same LR(0) component 147 LR (0) GOTO/ACTIONS tables GOTO table is indexed by state and a grammar symbol from the stack State i q0 q5 q1 ACTION Table GOTO Table + ( ) $ q7 q3 E T action q1 q6 shift q2 shift q2 q3 Z E$ q5 q7 q4 Shift q4 E E+T q5 T i q6 E T q7 q8 q5 q7 q3 q8 q9 q9 q6 shift shift T E ACTION table determined only by state, ignores input 148 SLR parsing • A handle should not be reduced to a non-terminal N if the lookahead is a token that cannot follow N • A reduce item N α is applicable only when the lookahead is in FOLLOW(N) – If b is not in FOLLOW(N) we just proved there is no derivation S * βNb. – Thus, it is safe to remove the reduce item from the conflicted state • Differs from LR(0) only on the ACTION table – Now a row in the parsing table may contain both shift actions and reduce actions and we need to consult the current token to decide which one to take 149 Lookahead token from the input SLR action table State i 0 shift 1 + ( ) [ ] $ state shift shift accept 2 3 q0 shift q1 shift q2 shift shift 4 E E+T E E+T 5 T i T i 6 E T E T 7 action shift E E+T shift T i E T shift 8 shift shift 9 T (E) T (E) SLR – use 1 token look-ahead … as before… T i T i[E] T (E) vs. q3 shift q4 E E+T q5 T i q6 E T q7 shift q8 shift q9 T E LR(0) – no look-ahead 150 LR(1) grammars • In SLR: a reduce item N α is applicable only when the lookahead is in FOLLOW(N) • But FOLLOW(N) merges lookahead for all alternatives for N – Insensitive to the context of a given production • LR(1) keeps lookahead with each LR item • Idea: a more refined notion of follows computed per item 151 LR(1) items • LR(1) item is a pair – LR(0) item – Lookahead token • Meaning – We matched the part left of the dot, looking to match the part on the right of the dot, followed by the lookahead token • Example – The production L id yields the following LR(1) items LR(1) items [L → ● id, *] (0) S’ → S LR(0) items [L → ● id, =] (1) S → L = R [L → ● id, id] [L → ● id] (2) S → R [L → ● id, $] [L → id ●] (3) L → * R [L → id ●, *] (4) L → id [L → id ●, =] (5) R → L [L → id ●, id] [L → id ●, $] 152 LALR(1) • LR(1) tables have huge number of entries • Often don’t need such refined observation (and cost) • Idea: find states with the same LR(0) component and merge their lookaheads component as long as there are no conflicts • LALR(1) not as powerful as LR(1) in theory but works quite well in practice – Merging may not introduce new shift-reduce conflicts, only reduce-reduce, which is unlikely in practice 153 Summary 154 LR is More Powerful than LL • Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k). – LR is more popular in automatic tools • But less intuitive • Also, the lookahead is counted differently in the two cases – In an LL(k) derivation the algorithm sees the left-hand side of the rule + k input tokens and then must select the derivation rule – In LR(k), the algorithm “sees” all right-hand side of the derivation rule + k input tokens and then reduces • LR(0) sees the entire right-side, but no input token 155 Grammar Hierarchy Non-ambiguous CFG LR(1) LALR(1) LL(1) SLR(1) LR(0) 156 Earley Parsing Jay Earley, PhD 157 Earley Parsing • Invented by Jay Earley [PhD. 1968] • Handles arbitrary context free grammars – Can handle ambiguous grammars • Complexity O(N3) when N = |input| • Uses dynamic programming – Compactly encodes ambiguity 158 Dynamic programming • Break a problem P into subproblems P1…Pk – Solve P by combining solutions for P1…Pk – Memo-ize (store) solutions to subproblems instead of re-computation • Bellman-Ford shortest path algorithm – Sol(x,y,i) = minimum of • Sol(x,y,i-1) • Sol(t,y,i-1) + weight(x,t) for edges (x,t) 159 Earley Parsing • Dynamic programming implementation of a recursive descent parser – S[N+1] Sequence of sets of “Earley states” • N = |INPUT| • Earley state (item) s is a sentential form + aux info – S[i] All parse tree that can be produced (by a RDP) after reading the first i tokens • S[i+1] built using S[0] … S[i] 160 Earley Parsing • Parse arbitrary grammars in O(|input|3) – O(|input|2) for unambigous grammer – Linear for most LR(k) langaues (next lesson) • Dynamic programming implementation of a recursive descent parser – S[N+1] Sequence of sets of “Earley states” • N = |INPUT| • Earley states is a sentential form + aux info – S[i] All parse tree that can be produced (by an RDP) after reading the first i tokens • S[i+1] built using S[0] … S[i] 161 Earley States • s = < constituent, back > – constituent (dotted rule) for Aαβ A•αβ predicated constituents Aα•β in-progress constituents Aαβ• completed constituents – back previous Early state in derivation 162 Earley States • s = < constituent, back > – constituent (dotted rule) for Aαβ A•αβ predicated constituents Aα•β in-progress constituents Aαβ• completed constituents – back previous Early state in derivation 163 Earley Parser Input = x[1…N] S[0] = <E’ •E, 0>; S[1] = … S[N] = {} for i = 0 ... N do until S[i] does not change do foreach s ∈ S[i] if s = <A…•a…, b> and a=x[i+1] then S[i+1] = S[i+1] ∪ {<A…a•…, b> } if s = <A … •X …, b> and Xα then S[i] = S[i] ∪ {<X•α, i > } if s = < A … • , b> and <X…•A…, k> ∈ S[b] then S[i] = S[i] ∪{<X…A•…, k> } // scan // predict // complete 164 Example P RACTICAL E ARL S0 S1 S → •E E → •E + E E → •n ,0 ,0 ,0 n E → n• S → E• E → E • +E ,0 ,0 ,0 E → E + •E E → •E + E E → •n ,0 ,2 ,2 E → n• E → E + E• E → E • +E S → E• ,2 ,0 ,2 ,0 S2 + if s = <A…•a…, b> and a=x[i+1] then S[i+1] = S[i+1] ∪ {<A…a•…, b> } if s = <A … •X …, b> and Xα then S[i] = S[i] ∪ {<X•α, i > } if s = < A … • , b> and <X…•A…, k> ∈ S[b] then S[i] = S[i] ∪{<X…A•…, k> } // scan S3 // predict // complete n FI GURE 1. Earley sets for the grammar E → E + E | n and the input n + n. Items in bold are ones which correspond to the input’s derivation. 165 166