Chapter 4: Syntax Analysis Part 2: Top-Down Parsing CSE4100 Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre CH4p2.1 Motivation CSE4100 Source We have a grammar Product We want a parsing algorithm Idea Synthesize the algorithm from the grammar Problem How are we going to use the grammar ??? Hint... We can look at the beginning of the input... CH4p2.2 Basic intuition Use a sliding window over the input stream The Input CSE4100 Benefit Reveal the input Slowly A little bit at a time – 1 token at a time – Maybe 2 at a time ? But systematically – Left to Right What kind of derivation can use tokens in this way? CH4p2.3 Lookahead CSE4100 Technique Use a window of lookahead Guide a LEFTMOST derivation with the window content So... How to guide ? CH4p2.4 Predictive Parsing CSE4100 Lookahead Predictive tool! Helps to select the right production Question How? Answered with an example... CH4p2.5 Example Consider the grammar stmt → if expr then stmt else stmt CSE4100 |while expr do stmt | for ( expr ; expr ; expr) stmt | { stmtList } | lvalue = expr lvalue → expr id → id | integer And the input fragment while x do { if x then .... Which production should we start with ? Why? CH4p2.6 Key Idea CSE4100 Use the production body To know what can start a production Selecting a production Pick the rule that can start with the symbol in the lookahead window! Using the production What should we do with chosen production? Production while expr do stmt Input while x do { if x then .... Matching CH4p2.7 Matching When input matches Eat the matched symbols CSE4100 In the production In the input Move the window further down Deal with the rest of the production How? Production while expr do stmt Input while x do { if x then .... CH4p2.8 No Matching? CSE4100 What does it mean? Example below In production: NON-TERMINAL In window: TERMINAL expr x [an identifier token] What should we do ? Production while expr do stmt Input while x do { if x then .... CH4p2.9 Dealing with non-terminals CSE4100 Simple... The non-terminal is defined somewhere The non-terminal is defined by a set of productions Corollary Pop the non-terminal Use the lookahead to choose a production for NT. Push and Recur CH4p2.10 On the Example Recall the grammar and the current state stmt → if expr then stmt else stmt .... CSE4100 lvalue → id expr Production while expr do stmt Input while x do { if x then .... 1. Pop result: expr 2. Choose production for expr result: expr → id 3. Push & Recur... result: Production Input → id | integer while id do stmt while x do { if x then .... CH4p2.11 Gain? CSE4100 Recall The topmost symbol ( id ) The lookahead x [an instance of id] Gain? The top symbol and lookahead match! So.... Recur Recursive call will match them and pop them Keep on going CH4p2.12 Final outcome What can happen in the end ? We “eat” (match) the entire input CSE4100 Meaning ? We get “stuck” at some point What does it mean (to be stuck) How can we get stuck ? CH4p2.13 Negative outcome CSE4100 We can get stuck Lookahead window and topmost non-terminal yield... An empty prediction! This is the “classic” syntax error Syntax error at file:line. Expecting xyz got abc Expecting xyz – A list of tokens predicting the productions of the current non-terminal Got abc – “abc” is the actual content of the lookahead window CH4p2.14 Overall Top-Down Algorithm CSE4100 Data structures A stack Holds the symbols to treat A lookahead window to choose a prediction Algorithm Startup Initialize stack to start symbol (a non-terminal) Initialize lookahead window at start of token stream Recursive Process Find out if front of window and top of stack match If match – Consume the symbols No match – Pop / Select / Push CH4p2.15 Are We Done ? CSE4100 Almost.... A Few Remaining issues... How do we get the predictions from the grammar ? What should we do if the same symbol predict >1 rule ? How can we implement the algorithm above ? How large should the lookahead window be ? Why is this called “top-down” ? What can be automated about all this ? CH4p2.16 The Lookahead Window How large ? Very good question. The art of counting CSE4100 1,2, a lot. Conclusion Usually, 1 token is plenty Occasionally, 2 tokens may be needed Beyond 2 is wild In theory Any finite constant will do 1,2,3....,k CH4p2.17 Lookahead Size and Languages With k tokens of lookahead Some languages can be parsed (this way) Some languages cannot be parsed What about k+1 tokens ? CSE4100 More languages can be parsed.... So the set of languages recognized with k is a subset of the set of languages recognized with k+1! We have a hierarchy! Still... In practice LL(1) should be enough. CH4p2.18 Top-Down Parsing CSE4100 Identify a leftmost derivation for an input string Why ? By always replacing the leftmost non-terminal symbol via a production rule, we are guaranteed of developing a parse tree in a left-to-right fashion consistent with scanning the input. A aBc adDc adec (scan a, scan d, scan e, scan c - accept!) Recursive-descent parsing concepts Predictive parsing Recursive / Brute force technique non-recursive / table driven Error recovery Implementation CH4p2.19 Top-Down Parsing CSE4100 Identify a leftmost derivation for an input string Why ? By always replacing the leftmost non-terminal symbol via a production rule, we are guaranteed of developing a parse tree in a left-to-right fashion consistent with scanning the input. A aBc adDc adec (scan a, scan d, scan e, scan c - accept!) Recursive-descent parsing concepts Predictive parsing Recursive / Brute force technique non-recursive / table driven Error recovery Implementation CH4p2.20 Recursive Descent Parsing Concepts CSE4100 • General category of Parsing Top-Down • Choose production rule based on input symbol • May require backtracking to correct a wrong choice. • Example: S cAd input: cad A ab | a S cad c cad cad d A S c a S c a A A d b cad d b S c A a Problem: backtrack cad d S c A d a CH4p2.21 Predictive Parsing : Recursive • CSE4100 To eliminate backtracking, grammar must have: no left recursion apply left factoring remove -moves If so, we can utilize current input symbol in conjunction with non-terminals to be expanded to uniquely determine the next action Utilize transition diagrams (TDs): For each non-terminal of the grammar: Create an initial and final state If A X1X2…Xn is a production, add path with edges X1, X2, … , Xn TDs can be algorithmized into Program CH4p2.22 Transition Diagrams (TDs) CSE4100 • Unlike lexical equivalents, each edge represents a token •Transition implies: if token, choose edge, call proc • Recall earlier grammar and its associated TDs F ( E ) | id T FT’ T’ * FT’ | E TE’ E’ + TE’ | E: 0 E’: 3 T + 1 4 E’ T 2 5 E’ How are transition diagrams used ? 6 T: 7 T’: 10 F: 14 F * ( 8 11 15 T’ F E id 9 12 16 T’ ) Are -moves a problem ? Can we simplify transition diagrams ? 13 Why is simplification critical ? 17 CH4p2.23 How are Transition Diagrams Used ? CSE4100 main() { TD_E(); } TD_E() { TD_T(); TD_E’(); } TD_T() { TD_F(); TD_T’(); } TD_E’() { token = get_token(); if token = ‘+’ then { TD_T(); TD_E’(); } } TD_F() { token = get_token(); if token = ‘(’ then { TD_E(); match(‘)’); } else if token.value <> id then {error + EXIT} else ... } What happened to -moves? NOTE: not all error conditions have been represented. TD_E’() { token = get_token(); if token = ‘*’ then { TD_F(); TD_T’(); } } CH4p2.24 How can Transition Diagrams be Simplified ? E’: 3 + 4 T 5 E’ 6 CSE4100 CH4p2.25 How can Transition Diagrams be Simplified ? (2) E’: 3 + 4 T 5 E’ 6 CSE4100 E’: 3 + 4 T 5 6 CH4p2.26 How can Transition Diagrams be Simplified ? (3) E’: 3 + 4 T 5 E’ 6 CSE4100 T E’: 3 + 4 T 5 E’: 3 + 4 6 6 CH4p2.27 How can Transition Diagrams be Simplified ? (4) E’: 3 + 4 T 5 E’ 6 CSE4100 T E’: 3 + 4 T 5 T + 4 6 6 0 3 E: E’: 1 E’ 2 CH4p2.28 How can Transition Diagrams be Simplified ? (5) E’: 3 + 4 T 5 E’ 6 CSE4100 T E’: 3 + 4 T 5 T + 4 6 6 0 3 E: E’: E’ 1 2 T E: 0 T 3 + 4 6 CH4p2.29 How can Transition Diagrams be Simplified ? (6) E’: 3 + 4 T 5 E’ 6 CSE4100 T E’: 3 + 4 T E’: 5 3 + 4 6 6 E: T 0 E’ 1 2 + T E: 0 T 3 + E: 4 How ? 6 0 T 3 6 CH4p2.30 Additional Transition Diagram Simplifications • Similar steps for T and T’ CSE4100 • Simplified Transition diagrams: * T: F 7 10 Why is simplification important ? 13 F T’: 10 * How does code change? 11 13 F: 14 ( 15 E 16 ) 17 id CH4p2.31 Motivating Table-Driven Parsing 1. Left to right scan input CSE4100 2. Find leftmost derivation Grammar: E TE’ E’ +TE’ | T id Terminator Input : id + id $ Derivation: E Processing Stack: CH4p2.32 Non-Recursive / Table Driven a + b $ CSE4100 Stack X NT + T symbols of CFG Y Empty stack symbol $ Z Input Predictive Parsing Program Output What actions parser should take based on stack / input Parsing Table M[A,a] General parser behavior: X : top of stack (String + terminator) a : current input 1. When X=a = $ halt, accept, success 2. When X=a $ , POP X off stack, advance input, go to 1. 3. When X is a non-terminal, examine M[X,a] if it is an error call recovery routine if M[X,a] = {X UVW}, POP X, PUSH W,V,U DO NOT expend any input CH4p2.33 Algorithm for Non-Recursive Parsing Set ip to point to the first symbol of w$; repeat CSE4100 let X be the top stack symbol and a the symbol pointed to by ip; if X is terminal or $ then Input pointer if X=a then pop X from the stack and advance ip else error() else /* X is a non-terminal */ if M[X,a] = XY1Y2…Yk then begin pop X from stack; push Yk, Yk-1, … , Y1 onto stack, with Y1 on top output the production XY1Y2…Yk end else error() May also execute other code based on the production used until X=$ /* stack is empty */ CH4p2.34 Example E TE’ E’ + TE’ | T FT’ T’ * FT’ | F ( E ) | id CSE4100 Our well-worn example ! Table M Nonterminal E INPUT SYMBOL id ( TFT’ $ E’ E’ T’ T’ TFT’ T’ Fid ) ETE’ E’+TE’ T’ F * ETE’ E’ T + T’*FT’ F(E) CH4p2.35 Trace of Example STACK CSE4100 $E $E’T $E’T’F $E’T’id $E’T’ $E’ $E’T+ $E’T $E’T’F $E’T’id $E’T’ $E’T’F* $E’T’F $E’T’id $E’T’ $E’ $ INPUT id + id * id$ id + id * id$ id + id * id$ id + id * id$ + id * id$ + id * id$ + id * id$ id * id$ id * id$ id * id$ * id$ * id$ id$ id$ $ $ $ OUTPUT E TE’ T FT’ F id T’ E’ +TE’ Expend Input T FT’ F id T’ *FT’ F id T’ E’ CH4p2.36 Leftmost Derivation for the Example The leftmost derivation for the example is as follows: CSE4100 E TE’ FT’E’ id T’E’ id E’ id + TE’ id + FT’E’ id + id T’E’ id + id * FT’E’ id + id * id T’E’ id + id * id E’ id + id * id CH4p2.37 What’s the Missing Puzzle Piece ? Constructing the Parsing Table M ! CSE4100 1st : Calculate First & Follow for Grammar 2nd: Apply Construction Algorithm for Parsing Table Conceptual Perspective: First: Let be a string of grammar symbols. First() are the first terminals that can appear in in any possible * derivation. NOTE: If , then is First( ). Follow: Let A be a non-terminal. Follow(A) is the set of terminals that can appear directly to the right of A in * some sentential form. (S Aa, for some and ). * NOTE: If S A, then $ is Follow(A). CH4p2.38 Computing First(X) : All Grammar Symbols 1. If X is a terminal, First(X) = {X} 2. If X is a production rule, add to First(X) CSE4100 3. If X is a non-terminal, and X Y1Y2…Yk is a production rule Place First(Y1) in First(X) * if Y1 , Place First(Y2) in First(X) * , if Y2 Place First(Y3) in First(X) … * , if Yk-1 Place First(Yk) in First(X) * , Stop. NOTE: As soon as Yi May repeat 1, 2, and 3, above for each Yj CH4p2.39 Computing First(X) : All Grammar Symbols - continued Informally, suppose we want to compute CSE4100 First(X1 X2 … Xn ) = First (X1) “+” First(X2) if is in First(X1) “+” First(X3) if is in First(X2) “+” … First(Xn) if is in First(Xn-1) Note 1: Only add to First(X1 X2 … Xn) if is in First(Xi) for all i Note 2: For First(X1), if X1 Z1 Z2 … Zm , then we need to compute First(Z1 Z2 … Zm) ! CH4p2.40 Conceptually: What is First (E, T, …) in Derivation? CSE4100 The leftmost derivation for the example is as follows: INPUT: id + id * id $ E $ TE’ FT’E’ id T’E’ id E’ id + TE’ id + FT’E’ id + id T’E’ id + id * FT’E’ id + id * id T’E’ id + id * id E’ id + id * id $ CH4p2.41 Example Computing First for: CSE4100 First(TE’) First(E) E TE’ E’ + TE’ | T FT’ T’ * FT’ | F ( E ) | id First(T) “+” First(E’) * Not First(E’) since T First(T) First(F) “+” First(T’) First((E)) “+” First(id) Overall: First(F) Not First(T’) since F * “(“ and “id” First(E) = { ( , id } = First(F) First(E’) = { + , } First(T’) = { * , } First(T) First(F) = { ( , id } CH4p2.42 Example 2 Given the production rules: CSE4100 S i E t SS’ | a S’ eS | E b Verify that First(S) = { i, a } First(S’) = { e, } First(E) = { b } CH4p2.43 Computing Follow(A) : All Non-Terminals 1. Place $ in Follow(S), where S is the start symbol and $ signals end of input CSE4100 2. If there is a production A B, then everything in First() is in Follow(B) except for . * 3. If A B is a production, or A B and (First() contains ), then everything in Follow(A) is in Follow(B) (Whatever followed A must follow B, since nothing follows B from the production rule) We’ll calculate Follow for two grammars. CH4p2.44 Conceptually: What is Follow in Derivation? The leftmost derivation for the example is as follows: CSE4100 INPUT: id + id * id $ E$ TE’ FT’E’ id T’E’ id E’ id + TE’ id + FT’E’ id + id T’E’ id + id * FT’E’ id + id * id T’E’ id + id * id E’ id + id * id $ CH4p2.45 Example Compute Follow for: CSE4100 E TE’ E’ + TE’ | T FT’ T’ * FT’ | F ( E ) | id • Follow(E) - contains $ since E is the start symbol. Also, since F (E) then First(“)”) is in Follow(E). Thus Follow(E) = { ) , $ } • Follow(E’) : E TE’ implies Follow(E) is in Follow(E’), and Follow(E’) = { ) , $ } * , put in • Follow(T) : E TE’ implies put in First(E’). Since E’ * , put in Follow(E). Since E’ +TE’ , Put in First(E’), and since E’ Follow(E’). Thus Follow(T) = { +, ), $ }. • Follow(T’) • Follow(F) You do these ! CH4p2.46 Computing Follow : 2nd Example Recall: CSE4100 S i E t SS’ | a First(S) = { i, a } S’ eS | First(S’) = { e, } E b First(E) = { b } Follow(S) – Contains $, since S is start symbol Since S i E t SS’ , put in First(S’) – not * , Put in Follow(S) Since S’ Since S’ eS, put in Follow(S’) So…. Follow(S) = { e, $ } Follow(S’) = Follow(S) HOW? Follow(E) = { t } CH4p2.47 First & Follow – One More Look Consider the following derivation: CSE4100 E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.48 First & Follow – One More Look Consider the following derivation: First(E) = { ( , id } What’s First for each non-terminal ? CSE4100 E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.49 First & Follow – One More Look Consider the following derivation: First(T) = { ( , id } What’s First for each non-terminal ? CSE4100 E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.50 First & Follow – One More Look Consider the following derivation: First(T’) = { * , } What’s First for each non-terminal ? CSE4100 E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ T’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.51 First & Follow – One More Look Consider the following derivation: First(E’) = { + , } What’s First for each non-terminal ? CSE4100 E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ E’ CH4p2.52 First & Follow – One More Look Consider the following derivation: First(F) = { ( , id } What’s First for each non-terminal ? CSE4100 You do First(F) ! E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.53 First & Follow – One More Look Consider the following derivation: What’s First for each non-terminal ? CSE4100 Still needs your First(F) E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ T’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ E’ CH4p2.54 First & Follow – One More Look Consider the following derivation: CSE4100 Follow(E) = { ( , id } What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.55 First & Follow – One More Look Consider the following derivation: CSE4100 Follow(T) = { + , ) , $ } What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ This “+” in Follow(T) comes from the First(E’) * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.56 First & Follow – One More Look Consider the following derivation: CSE4100 Follow(T’) = { + , ) , $ } What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ T’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.57 First & Follow – One More Look Consider the following derivation: CSE4100 Follow(E’) = { ) , $ } What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ E’ CH4p2.58 First & Follow – One More Look Consider the following derivation: CSE4100 Follow(F) = { +, *, ) , $ } What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ You do Follow(F) ! ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ CH4p2.59 First & Follow – One More Look Consider the following derivation: CSE4100 What’s Follow for each non-terminal ? Still needs your Follow(F) E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ T’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ E’ CH4p2.60 First & Follow – One More Look Consider the following derivation: CSE4100 Still needs your First(F) and Follow(F) What’s First for each non-terminal ? What’s Follow for each non-terminal ? E TE’ FT’E’ ( E ) T’E’ ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ T’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ E’ CH4p2.61 First & Follow – One More Look Consider the following derivation: What are implications ? CSE4100 1. M [ E, ( ] ( id ) * id + id$ (input) E TE’ FT’E’ ( E ) T’E’ 2. M [ T, ( ] 3. M [ F, ( ] ( TE’ ) T’E’ ( FT’E’ ) T’E’ ( id T’E’ ) T’E’ ( id E’ ) T’E’ ( id ) T’E’ ( id ) * FT’E’ M - Table 1. E TE’ and ( in First(E) 2. TFT’ and ( in First(T) 3. F (E) and ( in First(F) 4. E’ and ) in Follow(E’) 4. M [ E’, ) ] ( id ) * id T’E’ ( id ) * id E’ ( id ) * id + TE’ * ( id ) * id + id$ ( id ) * id + FT’E’ ( id ) * id + T’E’ 5. M [ T’, $ ] 6. M [ E’, $ ] 5. Since $ in Follow(T’), T’ 6. Since $ in Follow(E’), E’ CH4p2.62 Motivation Behind First & Follow First: CSE4100 Is used to indicate the relationship between non-terminals (in the stack) and input symbols (in input stream) Example: If A , and a is in First(), then when a=input, replace with . ( a is one of first symbols of , so when A is on the stack and a is input, POP A and PUSH . Follow: Is used when First has a conflict, to resolve choices. * , then what follows A dictates the When or next choice to be made. Example: If A , and b is in Follow(A ), then when a * , and if b is an input character, then we expand A with , which will eventually expand to , of which b follows! ( Above * . Here First( ) contains .) CH4p2.63 Constructing Parsing Table Algorithm: CSE4100 1. Repeat Steps 2 & 3 for each rule A 2. Terminal a in First()? Add A to M[A, a ] 3.1 in First()? Add A to M[A, a ] for all terminals b in Follow(A). 3.2 in First() and $ in Follow(A)? Add A to M[A, $ ] 4. All undefined entries are errors. CH4p2.64 Constructing Parsing Table - Example E TE’ E’ + TE’ | T FT’ CSE4100 T’ * FT’ | F ( E ) | id First(E,F,T) = { (, id } First(E’) = { +, } First(T’) = { *, } Follow(E,E’) = { ), $} Follow(F) = { *, +, ), } Follow(T,T’) = { +, ), } Expression Example: E TE’ : First(TE’) = First(T) = { (, id } M[E, ( ] : E TE’ M[E, id ] : E TE’ by rule 2 (by rule 2) E’ +TE’ : First(+TE’) = + : M[E’, +] : E’ +TE’ (by rule 3) E’ : in First( ) T’ : in First( ) M[E’, )] : E’ (3.1) M[T’, +] : T’ (3.1) M[E’, $] : E’ (3.2) M[T’, )] : T’ (3.1) (Due to Follow(E’) M[T’, $] : T’ (3.2) CH4p2.65 Constructing Parsing Table – Example 2 CSE4100 S i E t SS’ | a First(S) = { i, a } Follow(S) = { e, $ } S’ eS | First(S’) = { e, } Follow(S’) = { e, $ } E b First(E) = { b } Follow(E) = { t } S i E t SS’ Sa Eb First(i E t SS’)={i} First(a) = {a} First(b) = {b} S’ eS First(eS) = {e} S First() = {} Follow(S’) = { e, $ } INPUT SYMBOL Nonterminal a S S a b i t $ S iEtSS’ S’ S’ eS S’ E e S E b CH4p2.66 Example Step 1 Compute CSE4100 S→E$ T → F T’ E → T E’ T’ → * F T’ E’→ + T E’ → → F→(E) → Id Follow First Overall: First(S) = { First(E) = { ( , id } = First(F) First(E’) = { + , } First(T’) = { * , } First(T) First(F) = { ( , id } Follow(E) = Follow(E’) = { ), $ } Follow(T) = Follow(T’) = {+, ), $ } Follow(F) = {+, *, ), $ } CH4p2.67 Example CSE4100 Step 2 Build the parser table Step 3 Input: Id + Id * Id $ S→E$ T → F T’ E → T E’ T’ → * F T’ E’→ + T E’ → → F→(E) → Id Parser Table Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.68 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.69 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.70 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.71 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.72 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.73 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.74 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.75 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.76 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.77 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.78 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.79 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.80 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.81 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.82 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.83 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.84 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.85 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.86 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.87 Parsing Process Over Time CSE4100 Time Id + Id * Id $ Input Symbols NT Id + * ( S S → E$ S → E$ E E → TE’ E →TE’ E’ T E’ →+TE’ T → FT’ F F → Id $ E’ → E’ → T’ → T’ → T →FT’ T’ → T’ ) T’ →*FT’ F → (E) CH4p2.88 LL(1) Grammars L : Scan input from Left to Right L : Construct a Leftmost Derivation CSE4100 1 : Use “1” input symbol as lookahead in conjunction with stack to decide on the parsing action LL(1) grammars have no multiply-defined entries in the parsing table. Properties of LL(1) grammars: • Grammar can’t be ambiguous or left recursive • Grammar is LL(1) when A 1. & do not derive strings starting with the same terminal a 2. Either or can derive , but not both. Note: It may not be possible for a grammar to be manipulated into an LL(1) grammar CH4p2.89 Error Recovery When Do Errors Occur? Recall Predictive Parser Function: a + b $ CSE4100 Stack X Y Z $ Predictive Parsing Program Input Output Parsing Table M[A,a] 1. If X is a terminal and it doesn’t match input. 2. If M[ X, Input ] is empty – No allowable actions Consider two recovery techniques: A. Panic Mode B. Phase-level Recovery CH4p2.90 Panic Mode Recovery Augment parsing table with action that attempts to realign / synchronize token stream with the expected input. CSE4100 Suppose : A on top of stack doesn’t mesh with current input symbol 1. Use Follow(A) to remove input tokens – sync (discard) 2. Use First(A) to determine when to restart parsing 3. Incorporate higher level language concepts (begin/end, while, repeat/until) to sync actions we don’t skip tokens unnecessarily. Other actions: 4. When A , use it to manipulate stack to postpone error detection 5. Use non-matching terminal on stack as token that is inserted into input. CH4p2.91 Revised Parsing Table / Example Nonterminal CSE4100 E INPUT SYMBOL id ( ) ETE’ synch E’+TE’ TFT’ T’ F * ETE’ E’ T + Fid E’ synch TFT’ T’ T’*FT’ synch synch From Follow sets. Pop stack entry – T or NT synch T’ F(E) synch $ synch E’ synch T’ synch Skip input symbol CH4p2.92 Skip & Synch Meaning Skip CSE4100 Discard input symbol Synch Pop top of stack Messages Constructed based on lookahead an non-terminal Example NT = F Lookahead = + Expecting a FACTOR. Got + for a Term. So a factor is missing. CH4p2.93 Revised Parsing Table / Example(2) STACK CSE4100 $E $E $E’T $E’T’F $E’T’id $E’T’ $E’T’F* $E’T’F $E’T’ $E’ $E’T+ $E’T $E’T’F $E’T’id $E’T’ $E’ $ INPUT ) id * + id$ id * + id$ id * + id$ id * + id$ id * + id$ * + id$ * + id$ + id$ + id$ + id$ + id$ id$ id$ id$ $ $ $ Remark error, skip ) id is in First(E) error, M[F,+] = synch F has been popped CH4p2.94 Phase-Level Recovery CSE4100 Fill in blanks entries of parsing table with error handling routines These routines modify stack and / or input stream issue error message Problems: Modifying stack has to be done with care, so as to not create possibility of derivations that aren’t in language Infinite loops must be avoided Can be used in conjunction with panic mode to have more complete error handling CH4p2.95 How Would You Implement TD Parser • Stack – Easy to handle. Write ADT to manipulate its contents • Input Stream – Responsibility of lexical analyzer CSE4100 • Key Issue – How is parsing table implemented ? One approach: Assign unique IDS INPUT SYMBOL Nonterminal E id ( ) ETE’ synch E’+TE’ TFT’ T’ F * ETE’ E’ T + Fid All rules have unique IDs E’ synch TFT’ T’ T’*FT’ synch synch Ditto for synch actions synch T’ F(E) synch $ synch E’ synch T’ synch Also for blanks which handle errors CH4p2.96 Revised Parsing Table: Nonterminal CSE4100 INPUT SYMBOL id + * ( ) E 1 18 19 1 9 E’ 20 2 21 22 3 3 T 4 11 23 4 12 13 T’ 24 6 5 25 6 6 F 8 14 15 7 16 17 1 ETE’ 2 E’+TE’ 3 E’ 4 TFT’ 5 T’*FT’ 6 T’ 7 F(E) 8 Fid 9 – 17 : Sync Actions $ 10 18 – 25 : Error Handlers CH4p2.97 Revised Parsing Table: (2) Each # ( or set of #s) corresponds to a procedure that: CSE4100 • Uses Stack ADT • Gets Tokens • Prints Error Messages • Prints Diagnostic Messages • Handles Errors CH4p2.98 How is Parser Constructed ? One large CASE statement: CSE4100 state = M[ top(s), current_token ] switch (state) { case 1: proc_E_TE’( ) ; break ; … case 8: proc_F_id( ) ; break ; case 9: proc_sync_9( ) ; break ; … case 17: proc_sync_17( ) ; break ; case 18: … Procs to handle errors case 25: } Combine put in another switch Some sync actions may be same Some error handlers may be similar CH4p2.99 Final Comments – Top-Down Parsing CSE4100 So far, • We’ve examined grammars and language theory and its relationship to parsing • Key concepts: Rewriting grammar into an acceptable form • Examined Top-Down parsing: Brute Force : Transition diagrams & recursion Elegant : Table driven • We’ve identified its shortcomings: Not all grammars can be made LL(1) ! • Bottom-Up Parsing – Next Up! CH4p2.100