Ch4

Chapter 4
Lexical and Syntax
Analysis
ISBN 0-321-49362-1
Chapter 4 Topics
•
•
•
•
•
Introduction
Lexical Analysis
The Parsing Problem
Recursive-Descent Parsing
Bottom-Up Parsing
Copyright © 2009 Addison-Wesley. All rights reserved.
1-2
Introduction
• Language implementation systems must
analyze source code, regardless of the
specific implementation approach
• Nearly all syntax analysis is based on a
formal description of the syntax of the
source language (BNF)
Copyright © 2009 Addison-Wesley. All rights reserved.
1-3
Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two
parts:
– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on a
regular grammar)
– A high-level part called a syntax analyzer, or
parser (mathematically, a push-down
automaton based on a context-free grammar,
or BNF)
Copyright © 2009 Addison-Wesley. All rights reserved.
1-4
Advantages of Using BNF to Describe
Syntax
• Provides a clear and concise syntax
description
• The parser can be based directly on the BNF
• Parsers based on BNF are easy to maintain
Copyright © 2009 Addison-Wesley. All rights reserved.
1-5
Reasons to Separate Lexical and Syntax
Analysis
• Simplicity - less complex approaches can
be used for lexical analysis; separating
them simplifies the parser
• Efficiency - separation allows optimization
of the lexical analyzer
• Portability - parts of the lexical analyzer
may not be portable, but the parser always
is portable
Copyright © 2009 Addison-Wesley. All rights reserved.
1-6
Lexical Analysis
• A lexical analyzer is a pattern matcher for
character strings
• A lexical analyzer is a “front-end” for the
parser
• Identifies substrings of the source program
that belong together - lexemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token
– sum is a lexeme; its token may be IDENT
Copyright © 2009 Addison-Wesley. All rights reserved.
1-7
Lexical Analysis (continued)
• The lexical analyzer is usually a function that is
called by the parser when it needs the next token
• Three approaches to building a lexical analyzer:
– Write a formal description of the tokens and use a
software tool that constructs table-driven lexical
analyzers given such a description
– Design a state diagram that describes the tokens and
write a program that implements the state diagram
– Design a state diagram that describes the tokens and
hand-construct a table-driven implementation of the
state diagram
Copyright © 2009 Addison-Wesley. All rights reserved.
1-8
State Diagram Design
– A naïve state diagram would have a transition
from every state on every character in the
source language - such a diagram would be
very large!
Copyright © 2009 Addison-Wesley. All rights reserved.
1-9
Lexical Analysis (cont.)
• In many cases, transitions can be combined
to simplify the state diagram
– When recognizing an identifier, all uppercase
and lowercase letters are equivalent
• Use a character class that includes all letters
– When recognizing an integer literal, all digits are
equivalent - use a digit class
Copyright © 2009 Addison-Wesley. All rights reserved.
1-10
Lexical Analysis (cont.)
• Reserved words and identifiers can be
recognized together (rather than having a
part of the diagram for each reserved word)
– Use a table lookup to determine whether a
possible identifier is in fact a reserved word
Copyright © 2009 Addison-Wesley. All rights reserved.
1-11
State Diagram
Copyright © 2009 Addison-Wesley. All rights reserved.
1-12
digit
常數
開始
i
digit
i+1
非digit
i+2
識別字
開始
j
letter
j+1
非letter
且非digit
j+2
letter 或
digit
任何一種單語(token)均可用如上述之 transition diagram 表示!
Q1:這些 transition diagrams 如何自動被產生?
A: 詳情請看下圖說明
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
For automatic SCANNER !!
4-13
RE
正規表示式(Regular Expression)
NFA
非決定性有限自動機(Nondeterministic Finite State Automaton)
DFA
決定性有限自動機(Deterministic Finite State Automaton)
最小之決定性有限自動機(Minimized DFA) = Transition Diagram
Lex by Lesk in 1975
Source
Program
MDFA + Driver
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
token
4-14
.
.
.
““
; /* IGNORE BLANKS */
let
return(LET);
“*”
return(MUL);
“=”
return (ASSIGN);
[a-zA-Z] [a-zA-Z0-9]*
{make entries in tables; return (ID)}
(a)
%token
ASSIGN ID LET MUL ...
.
.
.
statement : LET ID ASSIGN EXPR
{...}
expr
: expr MUL expr
{$$ = build ( MUL, $1 , $3 ) ; }
expr
: ID
{...}
.
.
.
(b)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-15
The Parsing Problem
• Goals of the parser, given an input program:
– Find all syntax errors; for each, produce an
appropriate diagnostic message and recover
quickly
– Produce the parse tree, or at least a trace of the
parse tree, for the program
Copyright © 2009 Addison-Wesley. All rights reserved.
蹤影
1-16
The Parsing Problem (cont.)
• Two categories of parsers
– Top down - produce the parse tree, beginning
at the root
– Bottom up - produce the parse tree, beginning
at the leaves
• Useful parsers look only one token ahead in
the input
Copyright © 2009 Addison-Wesley. All rights reserved.
1-17
The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next
sentential form in the leftmost derivation, using
only the first token produced by A
• The most common top-down parsing
algorithms:
– Recursive descent - a coded implementation
– LL parsers - table driven implementation
Copyright © 2009 Addison-Wesley. All rights reserved.
1-18
The Parsing Problem (cont.)
• Bottom-up parsers
– Given a right sentential form, , determine what
substring of  is the right-hand side of the rule
in the grammar that must be reduced to
produce the previous sentential form in the
right derivation
– The most common bottom-up parsing
algorithms are in the LR family
Copyright © 2009 Addison-Wesley. All rights reserved.
1-19
The Parsing Problem (cont.)
• The Complexity of Parsing
– Parsers that work for any unambiguous
grammar are complex and inefficient[ due to
reparsing] ( O(n3), where n is the length of the
input )
– Compilers use parsers that only work for a
subset of all unambiguous grammars, but do it
in linear time ( O(n), where n is the length of the
input ) 目前泰半之 編譯程式 屬之
Copyright © 2009 Addison-Wesley. All rights reserved.
1-20
Recursive-Descent Parsing
• There is a subprogram for each
nonterminal in the grammar, which can
parse sentences that can be generated by
that nonterminal
• EBNF is ideally suited for being the basis for
a recursive-descent parser, because EBNF
minimizes the number of nonterminals
Copyright © 2009 Addison-Wesley. All rights reserved.
1-21
(一) 最早的語法解析方式
1. 利用recursive procedure撰寫
2. 可能需要back-tracking token
例示
S ::= cAd
A ::= ab
A ::= a
若input string 是 cad 其 top-down parsing 如下:
S
c
A
(1)
S
S
d
c
A
a
d
b
(2)
c
A
d
a
(3)
Procedure A( )
begin
Procedure S( )
isave = input-point
begin
if input symbol = ‘a’ then
if input symbol = ‘c’ then
ADVANCE( )
ADVANCE( )
if input symbol = ‘b’ then
if A( ) then
ADVANCE( )
if input symbol = ‘d’ then
return true
ADVANCE( )
end if
return true
input-point = isave
// 無法找到 ab //
end if
if input symbol = ‘a’ then
ADVANCE( )
end if
return true
end if
end if
return false
else
end
return false
Copyright © 2004 Pearson Addison-Wesley. All rights
endreserved.
if
end
4-22
1. 若文法有 left-recursion 則陷入無限循環; 如 A  A .
上述方法之缺失
2. Backtracking 造成時間浪費, 若能 lookahead 則可免 .
3. 選擇 production rules之次序會影響執行parsing 之效率 .
4. 一旦錯誤被檢測出, 其 error messages 往往只能用 syntax error 唐塞.
改善之策
1. Elimination of Left Recursion
若 A  A 1 | A 2 | . . . A n |  1 |  2 | . . . |  m
則可改成:
A   1 A’|  2 A’| . . . |  m A’
A’   1 A’ |  2 A’ | . . .  n A’| 
實例
1.
2.
3.
4.
5.
6.
E
E
T
T
F
F
::= E + T
::= T
::= T * F
::= F
::= (E)
::= id
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
E  TE’
E’  +TE’ | 
T  FT’
T’ *FT’ | 
F  (E) | id
4-23
Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS
requires an initial process to determine
which RHS is the right one.
– The correct RHS is chosen on the basis of the
next token of input (the lookahead)
– The next token is compared with the first token
that can be generated by each RHS until a match
is found
– If no match is found, it is a syntax error
Copyright © 2009 Addison-Wesley. All rights reserved.
1-24
Recursive-Descent Parsing (cont.)
• The LL Grammar Class
– The Left Recursion Problem
• If a grammar has left recursion, either direct or
indirect, it cannot be the basis for a top-down
parser
– A grammar can be modified to remove left recursion
For each nonterminal, A,
1. Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … |
βn
where none of the β‘s begins with A
2. Replace the original A-rules with
A → β1A’ | β2A’ | … | βnA’
A’ → α1A’ | α2A’ | … | αmA’ | ε
Copyright © 2009 Addison-Wesley. All rights reserved.
1-25
Recursive-Descent Parsing
• The other characteristic of grammars that
disallows top-down parsing is the lack of
pairwise disjointness
– The inability to determine the correct RHS on
the basis of one token of lookahead
– Def: FIRST() = {a |  =>* a }
(If  =>* ,  is in FIRST())
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-26
Recursive-Descent Parsing
• Pairwise Disjointness Test:
– For each nonterminal, A, in the grammar that
has more than one RHS, for each pair of rules, A
 i and A  j, it must be true that
FIRST(i) ∩ FIRST(j) = 
• Examples:
A  a | bB | cAb
is OK
A  a | aB
is NOT OK !
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-27
Recursive-Descent Parsing
• Left factoring can resolve the problem
Replace
<variable>  identifier | identifier [<expression>]
with
<variable>  identifier <new>
<new>   | [<expression>]
or
<variable>  identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-28
Bottom-up Parsing
• The parsing problem is finding the correct
RHS in a right-sentential form to reduce to
get the previous right-sentential form in
the derivation
Copyright © 2009 Addison-Wesley. All rights reserved.
1-29
Bottom-up Parsing
• The parsing problem is finding the correct
RHS in a right-sentential form to reduce to
get the previous right-sentential form in
the derivation
Pls refer to the next page
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-30
Given the grammar:
EE+T |T
TT*F |F
F  (E) | id
Rightmost derivation
EE+T
E+T*F
 E + T * id
 E + F * id
 E + id * id
 T + id * id
 F + id * id
 id + id * id
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-31
Bottom-up Parsing
• Advantages of LR parsers:
– They will work for nearly all grammars that
describe programming languages.
– They work on a larger class of grammars than
other bottom-up algorithms, but are as efficient
as any other bottom-up parser.
– They can detect syntax errors as soon as it is
possible.
– The LR class of grammars is a superset of the
class parsable by LL parsers.
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-32
Bottom-up Parsing
• LR parsers must be constructed with a tool
• Knuth’s insight: A bottom-up parser could
use the entire history of the parse, up to
the current point, to make parsing
decisions
– There were only a finite and relatively small
number of different parse situations that could
have occurred, so the history could be stored in
a parser state, on the parse stack
Historically sensitive states are there!
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-33
Bottom-up Parsing
• An LR configuration stores the state of an
LR parser
(S0X1S1X2S2…XmSm, aiai+1…an$)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-34
Bottom-up Parsing
• LR parsers are table driven, where the
table has two components, an ACTION
table and a GOTO table
– The ACTION table specifies the action of the
parser, given the parser state and the next
token
• Rows are state names; columns are terminals
– The GOTO table specifies which state to put
on top of the parse stack after a reduction
action is done
• Rows are state names; columns are nonterminals
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-35
Structure of An LR Parser
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-36
語法解析程式之設計: 如何判定輸入敘述句合乎文法 ?
• 語言由文法來描述其 static 結構
• 文法藉由 BNF (Backus-Normal / Naur Form)表達
英文句子( Sentence )用 BNF 描述,則如下述
<Sentence> :: = <Subject> <Verb> <Object>
non-terminal <Subject> :: = <Noun> | <Noun> <Adverb>
<Verb> :: = likes | gets | helps
<Object> :: = <Noun> | <Adjective> <Noun>
<Noun> ::= Horse | Man
<Adverb> ::= extremely | always
<Adjective> ::= beautiful | dirty
terminal 終端記號
以上述之文法( 以 BNF 描述者 )可以判斷下面三句是
真句子或不合乎句子之文法
Horse always helps dirty Man.
(是句子)
dirty Man extremely likes beautiful Horse.
(不是句子)
Man dirty gets Horse beautiful.
(不是句子)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-37
以圖形(構文樹 Parse tree ; 語法樹)表示上頁之第一個句子
< Sentence >
< Noun >
< Adverb >
Horse
always
< Adjective >
< Noun >
dirty
Man
helps
Button-up
parsing
< Object >
< Verb >
< Subject >
而語法(構文)解析程式之功能如下圖:
( 輸入 )
單語串
驅動副程式
driver
構文解析表
parsing table
文法解析 : parsing
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
( 輸出 )
構文樹
4-38
例如:Simplified Pascal Grammar.
<prog>
::= PROGRAM <prog-name> VAR <dec-list> BEGIN <stmt-list> END.
<prog-name> ::= id
<dec-list>
::= <dec>
| <dec-list> ; <dec>
<dec>
::= <id-list> : <type>
<type>
::= INTEGER
<id-list>
::= id | <id-list> , id
<stmt-list>
::= <stmt> | <stmt-list> ; <stmt>
<stmt>
::= <assign> | <read> | <write> | <for>
<assign>
::= id := <exp>
<exp>
::= <term> | <exp> + <term> | <exp> - <term>
<term>
::= <factor> | <term> * <factor> | <term> DIV <factor>
<factor>
::= id | int | (<exp>)
<read>
::= READ (<id-list>)
<write>
::= WRITE (<id-list>)
<for>
::= FOR <index-exp> DO <body>
<index-exp> ::= id := <exp> TO <exp>
<body>
<stmt>
| BEGIN
<stmt-list>
END
Copyright © 2004 ::=
Pearson
Addison-Wesley.
All rights
reserved.
4-39
℗ 根據 文法規則
將 單語串 建構成 語法樹
語法解析程式
單語串
Driver
Driver
stack
Parsing
table
語法樹
如何建構 語法解析表 (parsing table)?
方式有二:
1. Bottom-up parsing
2. Top-down parsing
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-40
1. (LR 解析法)
例示:有一組文法G定義如下:
1.
2.
3.
4.
5.
6.
E
E
T
T
F
F
::= E + T
::= T
::= T * F
::= F
::= (E)
::= id
acc: accept; 接受 (done!)
si:
shift to state i
ri:
reduce the ith production
blank: error occurs
根據左邊之文法G定義之文法解析表(parsing
state
Action
id
0
1
2
3
4
5
6
7
8
9
10
11
table)
+
*
s5
Goto
(
)
$
s4
s6
r2
s7
r2
acc
r2
r4
r4
r4
r4
s5
s4
r6
r6
s5
s5
r6
E
T
F
1
2
3
8
2
3
9
3
10
r6
s4
s4
s6
r1
r3
r5
s7
r3
r5
如下:
s11
r1
r3
r5
r1
r3
r5
LR Parser
單語串
Input buffer
Driver
Driver
stack
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
Parsing
table
語法樹
4-41
STACK
INPUT BUFFER
actions
id * id + id $
s5
0 id 5
* id + id $
r6
0 F 3
* id + id $
goto 3及 r4
0 T 2
* id + id $
goto 2及 s7
0
0 T 2 * 7
id + id $
s5
0 T 2 * 7 id 5
+ id $
r6
0 T 2 * 7 F 10
+ id $
goto 10 及r3
0 T 2
+ id $
goto 2及r2
0 E 1
+ id $
goto 1及s6
0 E 1 + 6
id $
s5
0 E 1 + 6 id 5
$
r6
0 E 1 + 6 F 3
$
goto 3及r4
0 E 1 + 6 T 9
$
goto 9及r1
0 E 1
$
goto 1及accept !
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-42
E 8
E
5
T
2
F
1
*
4
F
3
T
7
F
6
id
id
Bottom-up parsing
T
+
id
id * id + id  F * id + id  T * id + id  T * F + id
 T + id  E + id  E + F  E + T  E
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-43
Bottom-up Parsing
• Initial configuration: (S0, a1…an$)
• Parser actions:
– If ACTION[Sm, ai] = Shift S, the next
configuration is:
(S0X1S1X2S2…XmSmaiS, ai+1…an$)
– If ACTION[Sm, ai] = Reduce A   and S =
GOTO[Sm-r, A], where r = the length of , the
next configuration is
(S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-44
Bottom-up Parsing
• Parser actions (continued):
– If ACTION[Sm, ai] = Accept, the parse is
complete and no errors were found.
– If ACTION[Sm, ai] = Error, the parser calls an
error-handling routine.
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-45
LR Parsing Table
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-46
Bottom-up Parsing
• A parser table can be generated from a
given grammar with a tool, e.g., yacc
yet another compiler-compilers
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-47
Summary
• Syntax analysis is a common part of language
implementation
• A lexical analyzer is a pattern matcher that isolates
small-scale parts of a program
– Detects syntax errors
– Produces a parse tree
• A recursive-descent parser is an LL parser
– EBNF
• Parsing problem for bottom-up parsers: find the
substring of current sentential form
• The LR family of shift-reduce parsers is the most
common bottom-up parsing approach
Copyright © 2009 Addison-Wesley. All rights reserved.
1-48