Chapter 3 Chang Chi-Chung 2015.05.18 The Role of the Parser Source Program Lexical Analyzer Token Parser getNextToken Parse tree Symbol Table Rest of Front intermediate representation End 如何表示程式語言的文法? 使用 Context Free Grammar, 簡稱 CFG CFG 比起 Regular Expression 更 有威力 (powerful notation than RE) Context-Free Grammar Context-free grammar is a 4-tuple G = < T, N, P, S> where T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals is a finite set of productions of the form where N and (NT)* P S N is a designated start symbol Derivations The one-step derivation is defined by A where A is a production in the grammar In addition, we define is leftmost lm if does not contain a nonterminal is rightmost rm if does not contain a nonterminal Transitive closure * (zero or more steps) Positive closure + (one or more steps) Example of the Derivations list list + digit list - digit + digit digit - digit + digit 9 - digit + digit 9 - 5 + digit 9-5+2 Production list list + digit list list – digit list digit digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Leftmost derivation replaces the leftmost nonterminal (underlined) in each step. Rightmost derivation replaces the rightmost nonterminal in each step. Example of the Parser Tree Parse tree of the string 9-5+2 using grammar G list list list digit digit digit 9 - 5 + 2 The sequence of leafs is called the yield of the parse tree Sentence and Language Sentential form If Sentence A S * in the grammar G, then is a sentential form of G sentential form of G has no nonterminals. Language The language generated by G is it’s set of sentences. The language generated by G is defined by L(G) = { w T* | S * w } A language that can be generated by a grammar is said to be a Context-Free language. If two grammars generate the same language, the grammars are said to be equivalent. An Example Expr Op (a + b) x c Expr Expr Op c ( Expr ) ( Expr ) Op c | Expr Op name | name (Expr Op b) Op c + | | x | / ( a Op b ) Op c (a + b) Op c (a + b) x c Ambiguity A grammar that produces more than one parse tree for some sentence is said to be ambiguous. Example id + id * id E → E + E | E * E | ( E ) | id EE+E id + E id + E * E id + id * E id + id * id EE*E E+E*E id + E * E id + id * E id + id * id Example Consider the following context-free grammar G = <{string}, {+,-,0,1,2,3,4,5,6,7,8,9}, P, string> This grammar is ambiguous, because more than one parse tree represents the string 9-5+2 P = string string + string | string - string | 0 | 1 | … | 9 Example string string string 9 string string string - 5 string string + 2 9 string - 5 string + 2 Ambiguity Dangling-else Grammar stmt if expr then stmt | if expr then stmt else stmt | other if E1 then S1 else if E2 then S2 else S3 Eliminating Ambiguity(2) if E1 then if E2 then S1 else S2 Parsing The process of determining if a string of terminals (tokens) can be generated by a grammar. Time complexity: For any CFG there is a parser that takes at most O(n3) time to parse a string of n terminals. Linear algorithms suffice to parse essentially all languages that arise in practice. Two kinds of methods Top-down: constructs a parse tree from root to leaves Bottom-up: constructs a parse tree from leaves to root 兩種語法分析方式 Top-down Parsing 最左推導 不可以有左遞迴 不可以有左因子 明確性文法 RG LL(1) Bottom-up Parsing 最右推導 不可以有右遞迴 不可以有右因子 明確性文法 LR(1) CFG Notational Conventions Terminals a, b, c, … T example: 0, 1, +, *, id, if Nonterminals A, B, C, … N example: expr, term, stmt Grammar symbols X, Y, Z (N T) Strings of terminals u, v, w, x, y, z T* Strings of grammar symbols (sentential form) , , (N T)* The head of the first production is the start symbol, unless stated. Top-down Parsing recursive-descent parsing LL(1) Left-to-right, Leftmost derivation Creating the nodes of the parse tree in preorder ( depth-first ) Grammar ET+T T(E) T-E T id E Leftmost derivation E lm T + T lm id + T lm id + id E E T T + T id + E T T T id + id Recursive Descent Parsing Every nonterminal has one (recursive) procedure responsible for parsing the nonterminal’s syntactic category of input tokens When a nonterminal has multiple productions, each production is implemented in a branch of a selection statement based on input lookahead information Recursive Descent Parsing void A() { Choose an A-Production, AX1X2…Xk; for (i = 1 to k) { if ( Xi is a nonterminal) call procedure Xi(); else if ( Xi = current input symbol a ) advance the input to the next symbol; else } } /* an error has occurred */ Conclusion: Parsing and Translation Scheme Complete import java.io.*; class Parser { static int lookahead; public Parser() throws IOException { lookahead = System.in.read(); } void expr() { term(); while ( true ) { if ( lookahead == ‘+’ ) { match(‘+’); term(); System.out.write(‘+’); continue; } else if (lookahead == ‘-’) { match(‘-’); term(); System.out.write(‘-’); continue; } else return; } void term() throws IOException { if (Character.isDigit((char)lookahead){ System.out.write((char)lookahead); match(lookahead); } else throw new Error(“syntax error”); } void match(int t) throws IOException { if ( lookahead == t ) lookahead = System.in.read(); else throw new Error(“syntax error”); } } LL(1) LL(1) Grammar Predictive parsers, that is, recursive-descent parsers needing no backtracking, can be constructed for a class of grammars called LL(1) First “L” means the input from left to right. Second “L” means leftmost derivation. “1” for using one input symbol of lookahead at each step tp make parsing action decisions. No left-recursive. No ambiguous. FIRST and FOLLOW S a A α c β γ c is in FIRST(A) a is in FOLLOW(A) FIRST and FOLLOW The constructed of both top-down and bottomup parsers is aided by two functions, FIRST and FOLLOW, associated with a grammar G. During top-down parsing, FIRST and FOLLOW allow us to choose which production to apply. During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens. FIRST FIRST() The set of terminals that begin all strings derived from FIRST(a) = { a } if a T FIRST() = { } FIRST(A) = A FIRST () for A P FIRST(X1X2…Xk) = if FIRST (Xj) for all j = 1, …, i-1 then add non- in FIRST(Xi) to FIRST(X1X2…Xk) if FIRST (Xj) for all j = 1, …, k then add to FIRST (X1X2…Xk) FIRST(1) By definition of the FIRST, we can compute FIRST(X) If XT, then FIRST(X) = {X}. If XN, X→, then add to FIRST(X). XN, and X → Y1 Y2 . . . Yn, then add all non- elements of FIRST(Y1) to FIRST(X), if FIRST(Y1), then add all non- elements of FIRST(Y2) to FIRST(X), ..., if FIRST(Yn), then add to FIRST(X). If FOLLOW FOLLOW(A) the set of terminals that can immediately follow nonterminal A FOLLOW(A) = for all (B A ) P do add FIRST()-{} to FOLLOW(A) for all (B A ) P and FIRST() do add FOLLOW(B) to FOLLOW(A) for all (B A) P do add FOLLOW(B) to FOLLOW(A) if A is the start symbol S then add $ to FOLLOW(A) FOLLOW(1) By definition of the FOLLOW, we can compute FOLLOW(X) Put $ into FOLLOW(S). each A B, add all non- elements of FIRST() to FOLLOW(B). For each A B or A B, where FIRST(), add all of FOLLOW(A) to FOLLOW(B). For Example Give a Grammar G E → T E’ E’ → + T E’ | ε T → F T’ FIRST E ( E’ + T ( T’ * F ( T’ → * F T’ | ε F → ( E ) | id id id id FOLLOW E E’ T T’ F $ ) + * $ $ + + ) ) $ ) $ ) Using FIRST and FOLLOW to Write a Recursive Descent Parser rest() { if (lookahead in FIRST(+ term rest) ) { match(‘+’); term(); rest() } else if (lookahead in FIRST(- term rest) ) { match(‘-’); term(); rest() } else if (lookahead in FOLLOW(rest) ) return else error() expr term rest rest + term rest | - term rest | term id } FIRST(+ term rest) = { + } FIRST(- term rest) = { - } FOLLOW(rest) = { $ }