Chapter 4 Lexical and Syntax Analysis ISBN 0-321-49362-1 Chapter 4 Topics • • • • • Introduction Lexical Analysis The Parsing Problem Recursive-Descent Parsing Bottom-Up Parsing Copyright © 2009 Addison-Wesley. All rights reserved. 1-2 Introduction • Language implementation systems must analyze source code, regardless of the specific implementation approach • Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF) Copyright © 2009 Addison-Wesley. All rights reserved. 1-3 Syntax Analysis • The syntax analysis portion of a language processor nearly always consists of two parts: – A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar) – A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF) Copyright © 2009 Addison-Wesley. All rights reserved. 1-4 Advantages of Using BNF to Describe Syntax • Provides a clear and concise syntax description • The parser can be based directly on the BNF • Parsers based on BNF are easy to maintain Copyright © 2009 Addison-Wesley. All rights reserved. 1-5 Reasons to Separate Lexical and Syntax Analysis • Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser • Efficiency - separation allows optimization of the lexical analyzer • Portability - parts of the lexical analyzer may not be portable, but the parser always is portable Copyright © 2009 Addison-Wesley. All rights reserved. 1-6 Lexical Analysis • A lexical analyzer is a pattern matcher for character strings • A lexical analyzer is a “front-end” for the parser • Identifies substrings of the source program that belong together - lexemes – Lexemes match a character pattern, which is associated with a lexical category called a token – sum is a lexeme; its token may be IDENT Copyright © 2009 Addison-Wesley. All rights reserved. 1-7 Lexical Analysis (continued) • The lexical analyzer is usually a function that is called by the parser when it needs the next token • Three approaches to building a lexical analyzer: – Write a formal description of the tokens and use a software tool that constructs table-driven lexical analyzers given such a description – Design a state diagram that describes the tokens and write a program that implements the state diagram – Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram Copyright © 2009 Addison-Wesley. All rights reserved. 1-8 State Diagram Design – A naïve state diagram would have a transition from every state on every character in the source language - such a diagram would be very large! Copyright © 2009 Addison-Wesley. All rights reserved. 1-9 Lexical Analysis (cont.) • In many cases, transitions can be combined to simplify the state diagram – When recognizing an identifier, all uppercase and lowercase letters are equivalent • Use a character class that includes all letters – When recognizing an integer literal, all digits are equivalent - use a digit class Copyright © 2009 Addison-Wesley. All rights reserved. 1-10 Lexical Analysis (cont.) • Reserved words and identifiers can be recognized together (rather than having a part of the diagram for each reserved word) – Use a table lookup to determine whether a possible identifier is in fact a reserved word Copyright © 2009 Addison-Wesley. All rights reserved. 1-11 State Diagram Copyright © 2009 Addison-Wesley. All rights reserved. 1-12 digit 常數 開始 i digit i+1 非digit i+2 識別字 開始 j letter j+1 非letter 且非digit j+2 letter 或 digit 任何一種單語(token)均可用如上述之 transition diagram 表示! Q1:這些 transition diagrams 如何自動被產生? A: 詳情請看下圖說明 Copyright © 2004 Pearson Addison-Wesley. All rights reserved. For automatic SCANNER !! 4-13 RE 正規表示式(Regular Expression) NFA 非決定性有限自動機(Nondeterministic Finite State Automaton) DFA 決定性有限自動機(Deterministic Finite State Automaton) 最小之決定性有限自動機(Minimized DFA) = Transition Diagram Lex by Lesk in 1975 Source Program MDFA + Driver Copyright © 2004 Pearson Addison-Wesley. All rights reserved. token 4-14 . . . ““ ; /* IGNORE BLANKS */ let return(LET); “*” return(MUL); “=” return (ASSIGN); [a-zA-Z] [a-zA-Z0-9]* {make entries in tables; return (ID)} (a) %token ASSIGN ID LET MUL ... . . . statement : LET ID ASSIGN EXPR {...} expr : expr MUL expr {$$ = build ( MUL, $1 , $3 ) ; } expr : ID {...} . . . (b) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-15 The Parsing Problem • Goals of the parser, given an input program: – Find all syntax errors; for each, produce an appropriate diagnostic message and recover quickly – Produce the parse tree, or at least a trace of the parse tree, for the program Copyright © 2009 Addison-Wesley. All rights reserved. 蹤影 1-16 The Parsing Problem (cont.) • Two categories of parsers – Top down - produce the parse tree, beginning at the root – Bottom up - produce the parse tree, beginning at the leaves • Useful parsers look only one token ahead in the input Copyright © 2009 Addison-Wesley. All rights reserved. 1-17 The Parsing Problem (cont.) • Top-down Parsers – Given a sentential form, xA , the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A • The most common top-down parsing algorithms: – Recursive descent - a coded implementation – LL parsers - table driven implementation Copyright © 2009 Addison-Wesley. All rights reserved. 1-18 The Parsing Problem (cont.) • Bottom-up parsers – Given a right sentential form, , determine what substring of is the right-hand side of the rule in the grammar that must be reduced to produce the previous sentential form in the right derivation – The most common bottom-up parsing algorithms are in the LR family Copyright © 2009 Addison-Wesley. All rights reserved. 1-19 The Parsing Problem (cont.) • The Complexity of Parsing – Parsers that work for any unambiguous grammar are complex and inefficient[ due to reparsing] ( O(n3), where n is the length of the input ) – Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time ( O(n), where n is the length of the input ) 目前泰半之 編譯程式 屬之 Copyright © 2009 Addison-Wesley. All rights reserved. 1-20 Recursive-Descent Parsing • There is a subprogram for each nonterminal in the grammar, which can parse sentences that can be generated by that nonterminal • EBNF is ideally suited for being the basis for a recursive-descent parser, because EBNF minimizes the number of nonterminals Copyright © 2009 Addison-Wesley. All rights reserved. 1-21 (一) 最早的語法解析方式 1. 利用recursive procedure撰寫 2. 可能需要back-tracking token 例示 S ::= cAd A ::= ab A ::= a 若input string 是 cad 其 top-down parsing 如下: S c A (1) S S d c A a d b (2) c A d a (3) Procedure A( ) begin Procedure S( ) isave = input-point begin if input symbol = ‘a’ then if input symbol = ‘c’ then ADVANCE( ) ADVANCE( ) if input symbol = ‘b’ then if A( ) then ADVANCE( ) if input symbol = ‘d’ then return true ADVANCE( ) end if return true input-point = isave // 無法找到 ab // end if if input symbol = ‘a’ then ADVANCE( ) end if return true end if end if return false else end return false Copyright © 2004 Pearson Addison-Wesley. All rights endreserved. if end 4-22 1. 若文法有 left-recursion 則陷入無限循環; 如 A A . 上述方法之缺失 2. Backtracking 造成時間浪費, 若能 lookahead 則可免 . 3. 選擇 production rules之次序會影響執行parsing 之效率 . 4. 一旦錯誤被檢測出, 其 error messages 往往只能用 syntax error 唐塞. 改善之策 1. Elimination of Left Recursion 若 A A 1 | A 2 | . . . A n | 1 | 2 | . . . | m 則可改成: A 1 A’| 2 A’| . . . | m A’ A’ 1 A’ | 2 A’ | . . . n A’| 實例 1. 2. 3. 4. 5. 6. E E T T F F ::= E + T ::= T ::= T * F ::= F ::= (E) ::= id Copyright © 2004 Pearson Addison-Wesley. All rights reserved. E TE’ E’ +TE’ | T FT’ T’ *FT’ | F (E) | id 4-23 Recursive-Descent Parsing (cont.) • A nonterminal that has more than one RHS requires an initial process to determine which RHS is the right one. – The correct RHS is chosen on the basis of the next token of input (the lookahead) – The next token is compared with the first token that can be generated by each RHS until a match is found – If no match is found, it is a syntax error Copyright © 2009 Addison-Wesley. All rights reserved. 1-24 Recursive-Descent Parsing (cont.) • The LL Grammar Class – The Left Recursion Problem • If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser – A grammar can be modified to remove left recursion For each nonterminal, A, 1. Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … | βn where none of the β‘s begins with A 2. Replace the original A-rules with A → β1A’ | β2A’ | … | βnA’ A’ → α1A’ | α2A’ | … | αmA’ | ε Copyright © 2009 Addison-Wesley. All rights reserved. 1-25 Recursive-Descent Parsing • The other characteristic of grammars that disallows top-down parsing is the lack of pairwise disjointness – The inability to determine the correct RHS on the basis of one token of lookahead – Def: FIRST() = {a | =>* a } (If =>* , is in FIRST()) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-26 Recursive-Descent Parsing • Pairwise Disjointness Test: – For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A i and A j, it must be true that FIRST(i) ∩ FIRST(j) = • Examples: A a | bB | cAb is OK A a | aB is NOT OK ! Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-27 Recursive-Descent Parsing • Left factoring can resolve the problem Replace <variable> identifier | identifier [<expression>] with <variable> identifier <new> <new> | [<expression>] or <variable> identifier [[<expression>]] (the outer brackets are metasymbols of EBNF) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-28 Bottom-up Parsing • The parsing problem is finding the correct RHS in a right-sentential form to reduce to get the previous right-sentential form in the derivation Copyright © 2009 Addison-Wesley. All rights reserved. 1-29 Bottom-up Parsing • The parsing problem is finding the correct RHS in a right-sentential form to reduce to get the previous right-sentential form in the derivation Pls refer to the next page Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-30 Given the grammar: EE+T |T TT*F |F F (E) | id Rightmost derivation EE+T E+T*F E + T * id E + F * id E + id * id T + id * id F + id * id id + id * id Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-31 Bottom-up Parsing • Advantages of LR parsers: – They will work for nearly all grammars that describe programming languages. – They work on a larger class of grammars than other bottom-up algorithms, but are as efficient as any other bottom-up parser. – They can detect syntax errors as soon as it is possible. – The LR class of grammars is a superset of the class parsable by LL parsers. Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-32 Bottom-up Parsing • LR parsers must be constructed with a tool • Knuth’s insight: A bottom-up parser could use the entire history of the parse, up to the current point, to make parsing decisions – There were only a finite and relatively small number of different parse situations that could have occurred, so the history could be stored in a parser state, on the parse stack Historically sensitive states are there! Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-33 Bottom-up Parsing • An LR configuration stores the state of an LR parser (S0X1S1X2S2…XmSm, aiai+1…an$) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-34 Bottom-up Parsing • LR parsers are table driven, where the table has two components, an ACTION table and a GOTO table – The ACTION table specifies the action of the parser, given the parser state and the next token • Rows are state names; columns are terminals – The GOTO table specifies which state to put on top of the parse stack after a reduction action is done • Rows are state names; columns are nonterminals Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-35 Structure of An LR Parser Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-36 語法解析程式之設計: 如何判定輸入敘述句合乎文法 ? • 語言由文法來描述其 static 結構 • 文法藉由 BNF (Backus-Normal / Naur Form)表達 英文句子( Sentence )用 BNF 描述,則如下述 <Sentence> :: = <Subject> <Verb> <Object> non-terminal <Subject> :: = <Noun> | <Noun> <Adverb> <Verb> :: = likes | gets | helps <Object> :: = <Noun> | <Adjective> <Noun> <Noun> ::= Horse | Man <Adverb> ::= extremely | always <Adjective> ::= beautiful | dirty terminal 終端記號 以上述之文法( 以 BNF 描述者 )可以判斷下面三句是 真句子或不合乎句子之文法 Horse always helps dirty Man. (是句子) dirty Man extremely likes beautiful Horse. (不是句子) Man dirty gets Horse beautiful. (不是句子) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-37 以圖形(構文樹 Parse tree ; 語法樹)表示上頁之第一個句子 < Sentence > < Noun > < Adverb > Horse always < Adjective > < Noun > dirty Man helps Button-up parsing < Object > < Verb > < Subject > 而語法(構文)解析程式之功能如下圖: ( 輸入 ) 單語串 驅動副程式 driver 構文解析表 parsing table 文法解析 : parsing Copyright © 2004 Pearson Addison-Wesley. All rights reserved. ( 輸出 ) 構文樹 4-38 例如:Simplified Pascal Grammar. <prog> ::= PROGRAM <prog-name> VAR <dec-list> BEGIN <stmt-list> END. <prog-name> ::= id <dec-list> ::= <dec> | <dec-list> ; <dec> <dec> ::= <id-list> : <type> <type> ::= INTEGER <id-list> ::= id | <id-list> , id <stmt-list> ::= <stmt> | <stmt-list> ; <stmt> <stmt> ::= <assign> | <read> | <write> | <for> <assign> ::= id := <exp> <exp> ::= <term> | <exp> + <term> | <exp> - <term> <term> ::= <factor> | <term> * <factor> | <term> DIV <factor> <factor> ::= id | int | (<exp>) <read> ::= READ (<id-list>) <write> ::= WRITE (<id-list>) <for> ::= FOR <index-exp> DO <body> <index-exp> ::= id := <exp> TO <exp> <body> <stmt> | BEGIN <stmt-list> END Copyright © 2004 ::= Pearson Addison-Wesley. All rights reserved. 4-39 ℗ 根據 文法規則 將 單語串 建構成 語法樹 語法解析程式 單語串 Driver Driver stack Parsing table 語法樹 如何建構 語法解析表 (parsing table)? 方式有二: 1. Bottom-up parsing 2. Top-down parsing Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-40 1. (LR 解析法) 例示:有一組文法G定義如下: 1. 2. 3. 4. 5. 6. E E T T F F ::= E + T ::= T ::= T * F ::= F ::= (E) ::= id acc: accept; 接受 (done!) si: shift to state i ri: reduce the ith production blank: error occurs 根據左邊之文法G定義之文法解析表(parsing state Action id 0 1 2 3 4 5 6 7 8 9 10 11 table) + * s5 Goto ( ) $ s4 s6 r2 s7 r2 acc r2 r4 r4 r4 r4 s5 s4 r6 r6 s5 s5 r6 E T F 1 2 3 8 2 3 9 3 10 r6 s4 s4 s6 r1 r3 r5 s7 r3 r5 如下: s11 r1 r3 r5 r1 r3 r5 LR Parser 單語串 Input buffer Driver Driver stack Copyright © 2004 Pearson Addison-Wesley. All rights reserved. Parsing table 語法樹 4-41 STACK INPUT BUFFER actions id * id + id $ s5 0 id 5 * id + id $ r6 0 F 3 * id + id $ goto 3及 r4 0 T 2 * id + id $ goto 2及 s7 0 0 T 2 * 7 id + id $ s5 0 T 2 * 7 id 5 + id $ r6 0 T 2 * 7 F 10 + id $ goto 10 及r3 0 T 2 + id $ goto 2及r2 0 E 1 + id $ goto 1及s6 0 E 1 + 6 id $ s5 0 E 1 + 6 id 5 $ r6 0 E 1 + 6 F 3 $ goto 3及r4 0 E 1 + 6 T 9 $ goto 9及r1 0 E 1 $ goto 1及accept ! Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-42 E 8 E 5 T 2 F 1 * 4 F 3 T 7 F 6 id id Bottom-up parsing T + id id * id + id F * id + id T * id + id T * F + id T + id E + id E + F E + T E Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-43 Bottom-up Parsing • Initial configuration: (S0, a1…an$) • Parser actions: – If ACTION[Sm, ai] = Shift S, the next configuration is: (S0X1S1X2S2…XmSmaiS, ai+1…an$) – If ACTION[Sm, ai] = Reduce A and S = GOTO[Sm-r, A], where r = the length of , the next configuration is (S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$) Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-44 Bottom-up Parsing • Parser actions (continued): – If ACTION[Sm, ai] = Accept, the parse is complete and no errors were found. – If ACTION[Sm, ai] = Error, the parser calls an error-handling routine. Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-45 LR Parsing Table Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-46 Bottom-up Parsing • A parser table can be generated from a given grammar with a tool, e.g., yacc yet another compiler-compilers Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 4-47 Summary • Syntax analysis is a common part of language implementation • A lexical analyzer is a pattern matcher that isolates small-scale parts of a program – Detects syntax errors – Produces a parse tree • A recursive-descent parser is an LL parser – EBNF • Parsing problem for bottom-up parsers: find the substring of current sentential form • The LR family of shift-reduce parsers is the most common bottom-up parsing approach Copyright © 2009 Addison-Wesley. All rights reserved. 1-48