CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #13 Parsing 1 Top-down Parsing S A fringe of the parse tree start symbol D B ? C S left-to-right scan ? left-most derivation lookahead Bottom-up Parsing lookahead S input string upper fringe of the parse tree ? A D right-most derivation in reverse C lookahead 2 LR Parsers • LR(k) parsers are table-driven, bottom-up, shift-reduce parsers that use a limited right context (k-token lookahead) for handle recognition • LR(k): Left-to-right scan of the input, Rightmost derivation in reverse with k token lookahead A grammar is LR(k) if, given a rightmost derivation S 0 1 2 … n-1 n sentence We can 1. isolate the handle of each right-sentential form i , and 2. determine the production by which to reduce, by scanning i from left-to-right, going at most k symbols beyond the right end of the handle of i 3 LR Parsers A table-driven LR parser looks like Stack source code grammar Scanner Table-driven Parser Parser Generator ACTION & GOTO Tables IR 4 LR Shift-Reduce Parsers push($); // $ is the end-of-file symbol push(s0); // s0 is the start state of the DFA that recognizes handles lookahead = get_next_token(); repeat forever s = top_of_stack(); if ( ACTION[s,lookahead] == reduce ) then pop 2*|| symbols; s = top_of_stack(); push(); push(GOTO[s,]); else if ( ACTION[s,lookahead] == shift si ) then push(lookahead); push(si); lookahead = get_next_token(); else if ( ACTION[s,lookahead] == accept and lookahead == $ ) then return success; else error(); The skeleton parser •uses ACTION & GOTO • does |words| shifts • does |derivation| reductions • does 1 accept 5 LR Parsers How does this LR stuff work? • Unambiguous grammar unique rightmost derivation • Keep upper fringe on a stack – All active handles include TOS – Shift inputs until TOS is right end of a handle • Language of handles is regular – Build a handle-recognizing DFA – ACTION & GOTO tables encode the DFA • To match subterms, recurse & leave DFA’s state on stack • Final states of the DFA correspond to reduce actions – New state is GOTO[lhs , state at TOS] 6 Building LR Parsers How do we generate the ACTION and GOTO tables? • Use the grammar to build a model of the handle recognizing DFA • Use the DFA model to build ACTION & GOTO tables • If construction succeeds, the grammar is LR How do we build the handle recognizing DFA ? • Encode the set of productions that can be used as handles in the DFA state: Use LR(k) items • Use two functions goto( s, ) and closure( s ) – goto() is analogous to move() in the DFA to NFA conversion – closure() is analogous to -closure • Build up the states and transition functions of the DFA • Use this information to fill in the ACTION and GOTO tables 7 LR(k) items An LR(k) item is a pair [A , B], where A is a production with a • at some position in the rhs B is a lookahead string of length ≤ k (terminal symbols or $) Examples: [• , a], [• , a], [• , a], & [• , a] The • in an item indicates the position of the top of the stack • LR(0) items [ • ] (no lookahead symbol) • LR(1) items [ • , a ] (one token lookahead) • LR(2) items [ • , a b ] (two token lookahead) ... 8 LR(1) Items The production •, with lookahead a, generates 4 items [• , a], [• , a], [• , a], & [• , a] The set of LR(1) items for a grammar is finite What’s the point of all these lookahead symbols? • Carry them along to choose correct reduction • Lookaheads are bookkeeping, unless item has • at right end – Has no direct use in [• , a] – In [• , a], a lookahead of a implies a reduction by – For { [• , a],[• , b] } lookahead = a reduce to ; lookahead FIRST() shift Limited right context is enough to pick the actions 9 LR(1) Table Construction High-level overview Build the handle recognizing DFA (aka Canonical Collection of sets of LR(1) items), C = { I0 , I1 , ... , In } a Introduce a new start symbol S’ which has only one production S’ S b Initial state, I0 should include • [S’ •S, $], along with any equivalent items • Derive equivalent items as closure( I0 ) c Repeatedly compute, for each Ik , and each grammar symbol , goto(Ik , ) • If the set is not already in the collection, add it • Record all the transitions created by goto( ) This eventually reaches a fixed point 2 Fill in the ACTION and GOTO tables using the DFA The canonical collection completely encodes the transition diagram for the handle-finding DFA 10 Computing Closures closure(I) adds all the items implied by items already in I • Any item [ , a] implies [ , x] for each production with on the lhs, and x FIRST(a) • Since is valid, any way to derive is valid, too The algorithm Closure( I ) while ( I is still changing ) for each item [ • , a] I for each production P for each terminal b FIRST(a) if [ • , b] I then add [ • , b] to I Fixpoint computation 11 Computing Gotos goto(I , x) computes the state that the parser would reach if it recognized an x while in state I • goto( { [ , a] }, ) produces [ , a] • It also includes closure( [ , a] ) to fill out the state The algorithm Goto( I, x ) new = Ø for each [ • x , a] I new = new [ x • , a] • Not a fixpoint method • Uses closure return closure(new) 12 Building the Canonical Collection of LR(1) Items This is where we build the handle recognizing DFA! Start from I0 = closure( [S’ • S , $] ) Repeatedly construct new states, until ano new states are generated The algorithm I0 = closure( [S’ • S , $] ) C = { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii , x) if Inew C then C = C Inew record transition Ii Inew on x • Fixed-point computation • Loop adds to C • C 2ITEMS, so C is finite 13 Algorithms for LR(1) Parsers Computing closure of set of LR(1) items: Computing goto for set of LR(1) items: Closure( I ) while ( I is still changing ) for each item [ • , a] I for each production P for each terminal b FIRST(a) if [ • , b] I then add [ • , b] to I Goto( I, x ) new = Ø for each [ • x , a] I new = new [ x • , a] return closure(new) Constructing canonical collection of LR(1) items: I0 = closure( [S’ • S , $] ) C = { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii , x) if Inew C then C = C Inew record transition Ii Inew on x • Canonical collection construction algorithm is the algorithm for constructing handle recognizing DFA • Uses Closure to compute the states of the DFA • Uses Goto to compute the transitions of the DFA 14 Example Simplified, right recursive expression grammar S Expr Expr Term - Expr Expr Term Term Factor * Term Term Factor Factor id Symbol S Expr Term Factor * id FIRST { id } { id} { id } { id} {-} {*} { id} 15 I1= {[S Expr •, $]} Expr I0 = {[S •Expr ,$], [Expr •Term - Expr , $], [Expr •Term , $], [Term •Factor * Term , {$,-}], [Term •Factor , {$,-}], [Factor • id ,{$,-,*}]} Term id Factor I4 = { [Factor id •, {$,-,*}] } id Factor I2 = { [Expr Term • - Expr , $], [Expr Term •, $] } I3 = {[Term Factor •* Term , {$,-}], [Term Factor •, {$,-}]} * I6 ={[Term Factor * • Term , {$,-}], [Term • Factor * Term , {$,-}], [Term • Factor , {$,-}], [Factor • id , {$, -, *}]} Factor - Term id I5 = {[Expr Term - •Expr , $], [Expr •Term - Expr , $], [Expr •Term , $], [Term •Factor * Term , {$,-}], [Term •Factor , {$,-}], [Factor • id , {$,-,*}] } I8 = { [Term Factor * Term •, {$,-}] } Expr I7 = { [Expr Term - Expr •, $] } 16 Constructing the ACTION and GOTO Tables The algorithm for each set of items Ix C for each item Ix if item is [ •a,b] and a T and goto(Ix,a) = Ik , then ACTION[x,a] “shift k” else if item is [S’S •,$] then ACTION[x ,$] “accept” else if item is [ •,a] then ACTION[x,a] “reduce ” x is the state number Each state corresponds to a set of LR(1) items for each n NT if goto(Ix ,n) = Ik then GOTO[x,n] k 17 Example (Constructing the LR(1) tables) The algorithm produces the following table A C T IO N id 0 - GOTO * $ s 4 1 T e rm F a c to r 1 2 3 7 2 3 8 3 acc 2 s 5 3 r 5 s 6 r 5 4 r 6 r 6 r 6 5 s 4 6 s 4 7 8 Expr r 3 r 2 r 4 r 4 18 What can go wrong in LR(1) parsing? What if state s contains [ • a , b] and [ • , a] ? • First item generates “shift”, second generates “reduce” • Both define ACTION[s,a] — cannot do both actions • This is called a shift/reduce conflict • Modify the grammar to eliminate it • Shifting will often resolve it correctly (remember the danglin else problem?) What if set s contains [ • , a] and [ • , a] ? • Each generates “reduce”, but with a different production • Both define ACTION[s,a] — cannot do both reductions • This is called a reduce/reduce conflict • Modify the grammar to eliminate it In either case, the grammar is not LR(1) 19 Algorithms for LR(0) Parsers Computing closure of set of LR(0) items: Closure( I ) while ( I is still changing ) for each item [ • ] I for each production P if [ • ] I then add [ • ] to I Constructing canonical collection of LR(0) items: I0 = closure( [S’ • S ] ) C = { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii , x) if Inew C then C = C Inew record transition Ii Inew on x Computing goto for set of LR(0) items: Goto( I, x ) new = Ø for each [ • x ] I new = new [ x • ] return closure(new) • Basically the same algorithms we use for LR(1) • The only difference is there is no lookahead symbol in LR(0) items and therefore there is no lookahead propagation • LR(0) parsers generate less states • They cannot recognize all the grammars that LR(1) parsers recognize • They are more likely to get conflicts in the parse table since there is no 20 lookahead SLR (Simple LR) Table Construction • SLR(1) algorithm uses canonical collection of LR(0) items • It also uses the FOLLOW sets to decide when to do a reduction for each set of items Ix C , for each item Ix if item is [ •a] and a T and goto(Ix,a) = Ik then ACTION[x,a] “shift k” else if item is [S’S •] then ACTION[x ,$] “accept” else if item is [ •] then for each a FOLLOW() then ACTION[x,a] “reduce ” for each n NT if goto(Ix ,n) = Ik then GOTO[x,n] k 21 LALR (LookAhead LR) Table Construction • • LR(1) parsers have many more states than SLR(1) parsers Idea: Merge the LR(1) states – Look at the LR(0) core of the LR(1) items (meaning ignore the lookahead) – If two sets of LR(1) items have the same core, merge them and change the ACTION and GOTO tables accordingly • LALR(1) parsers can be constructed in two ways 1. Build the canonical collection of LR(1) items and then merge the sets with the same core ( you should be able to do this) 2. Ignore the items with the dot at the beginning of the rhs and construct kernels of sets of LR(0) items. Use a lookahead propagation algorithm to compute lookahead symbols. The second approach is more efficient since it avoids the construction of the large LR(1) table. 22 Properties of LALR(1) Parsers • An LALR(1) parser for a grammar G has same number of states as the SLR(1) parser for that grammar • If an LR(1) parser for a grammar G does not have any shift/reduce conflicts, then the LALR(1) parser for that grammar will not have any shift/reduce conflicts • It is possible for LALR(1) parser for a grammar G to have a reduce/reduce conflict while an LR(1) parser for that grammar does not have any conflicts 23 Error recovery • Panic-mode recovery: On discovering an error discard input symbols one at a time until one synchronizing token is found – For example delimiters such as “;” or “}” can be used as synchronizing tokens • Phrase-level recovery: On discovering an error make local corrections to the input – For example replace “,” with “;” • Error-productions: If we have a good idea about what type of errors occur we can augment the grammar with error productions and generate appropriate error messages when an error production is used • Global correction: Given an incorrect input string try to find a correct string which will require minimum changes to the input string – In general too costly 24 Direct Encoding of Parse Tables Rather than using a table-driven interpreter … • Generate spaghetti code that implements the logic • Each state becomes a small case statement or if-then-else • Analogous to direct coding a scanner Advantages • No table lookups and address calculations • No representation for don’t care states • No outer loop —it is implicit in the code for the states This produces a faster parser with more code but no table 25 LR(k) versus LL(k) (Top-down Recursive Descent ) Finding Reductions LR(k) Each reduction in the parse is detectable with the complete left context, the reducible phrase, itself, and the k terminal symbols to its right LL(k) Parser must select the reduction based on The complete left context The next k terminals Thus, LR(k) examines more context 26 Left Recursion versus Right Recursion Right recursion • Required for termination in top-down parsers • Uses (on average) more stack space • Produces right-associative operators Left recursion • More efficient bottom-up parsers • Limits required stack space • Produces left-associative operators * * w * x z y w*(x*(y*z)) * * * w z y x Rule of thumb ( (w * x ) * y ) * z • Left recursion for bottom-up parsers (Bottom-up parsers can also deal with right-recursion but left recursion is more efficient) • Right recursion for top-down parsers 27 Hierarchy of Context-Free Languages Context-free languages LR(k) LR(1) Deterministic languages (LR(k)) LL(k) languages LL(1) languages Simple precedence languages The inclusion hierarchy for context-free languages Operator precedence languages 28 Hierarchy of Context-Free Grammars Context-free grammars Unambiguous CFGs Operator Precedence LR(k) LR(1) Inclusion hierarchy for context-free grammars LL(k) • Operator precedence LALR(1) includes some ambiguous grammars • SLR(1) LL(1) is a subset of SLR(1) LL(1) LR(0) 29 Summary Top-down recursive descent LR(1) Advantages Disadvantages Fast Hand-coded Simplicity High maintenance Good error detection Right associativity Fast Large working sets Deterministic langs. Poor error messages Automatable Large table sizes Left associativity 30