ch3

advertisement
Chapter 3
Chang Chi-Chung
2015.05.18
The Role of the Parser
Source
Program
Lexical
Analyzer
Token
Parser
getNextToken
Parse
tree
Symbol Table
Rest of Front intermediate
representation
End
如何表示程式語言的文法?
使用
Context Free Grammar,
簡稱 CFG
CFG 比起 Regular Expression 更
有威力 (powerful notation than
RE)
Context-Free Grammar
 Context-free
grammar is a 4-tuple
G = < T, N, P, S> where
T
is a finite set of tokens (terminal symbols)
N
is a finite set of nonterminals
is a finite set of productions of the form

where   N and   (NT)*
P
S
 N is a designated start symbol
Derivations
 The
one-step derivation is defined by
A
where A   is a production in the grammar
 In addition, we define

is leftmost  lm if  does not contain a
nonterminal
  is rightmost  rm if  does not contain a
nonterminal
 Transitive closure  * (zero or more steps)
 Positive closure  + (one or more steps)
Example of the Derivations
list
 list + digit
 list - digit + digit
 digit - digit + digit
 9 - digit + digit
 9 - 5 + digit
9-5+2

Production
 list  list + digit
 list  list – digit
 list  digit
 digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Leftmost derivation



replaces the leftmost nonterminal (underlined) in each step.
Rightmost derivation

replaces the rightmost nonterminal in each step.
Example of the Parser Tree

Parse tree of the string 9-5+2 using grammar G
list
list
list
digit
digit
digit
9
-
5
+
2
The sequence of
leafs is called the
yield of the parse tree
Sentence and Language

Sentential form
 If

Sentence
A

S *  in the grammar G, then  is a sentential form of G
sentential form of G has no nonterminals.
Language
 The
language generated by G is it’s set of sentences.
 The
language generated by G is defined by
L(G) = { w  T* | S * w }
 A language
that can be generated by a grammar is said to be a Context-Free
language.

If two grammars generate the same language, the grammars
are said to be equivalent.
An Example
 Expr
 Op
(a
+ b) x c
 Expr  Expr Op c
 ( Expr )
 ( Expr ) Op c
| Expr Op name
| name
 (Expr Op b) Op c
 +
| | x
| /
( a Op b ) Op c
(a + b) Op c
(a + b) x c
Ambiguity

A grammar that produces more than one parse tree for
some sentence is said to be ambiguous.

Example

id + id * id
E → E + E | E * E | ( E ) | id
EE+E
 id + E
 id + E * E
 id + id * E
 id + id * id
EE*E
E+E*E
 id + E * E
 id + id * E
 id + id * id
Example

Consider the following context-free grammar
G = <{string}, {+,-,0,1,2,3,4,5,6,7,8,9}, P, string>

This grammar is ambiguous, because more than one parse tree represents
the string 9-5+2
P = string  string + string | string - string | 0 | 1 | … | 9
Example
string
string
string
9
string
string
string
-
5
string
string
+
2
9
string
-
5
string
+
2
Ambiguity
 Dangling-else
Grammar
stmt  if expr then stmt
| if expr then stmt else stmt
| other
if E1 then S1 else if E2 then S2 else S3
Eliminating Ambiguity(2)
if E1 then if E2 then S1 else S2
Parsing



The process of determining if a string of terminals
(tokens) can be generated by a grammar.
Time complexity:

For any CFG there is a parser that takes at most O(n3) time
to parse a string of n terminals.

Linear algorithms suffice to parse essentially all languages
that arise in practice.
Two kinds of methods

Top-down: constructs a parse tree from root to leaves

Bottom-up: constructs a parse tree from leaves to root
兩種語法分析方式

Top-down Parsing
最左推導
 不可以有左遞迴
 不可以有左因子
 明確性文法


RG
LL(1)
Bottom-up Parsing
最右推導
 不可以有右遞迴
 不可以有右因子
 明確性文法

LR(1)
CFG
Notational Conventions






Terminals
 a, b, c, …  T
 example: 0, 1, +, *, id, if
Nonterminals
 A, B, C, …  N
 example: expr, term, stmt
Grammar symbols
 X, Y, Z  (N  T)
Strings of terminals
 u, v, w, x, y, z  T*
Strings of grammar symbols (sentential form)
 , ,   (N  T)*
The head of the first production is the start symbol, unless stated.
Top-down Parsing

recursive-descent parsing

LL(1)

Left-to-right, Leftmost derivation

Creating the nodes of the parse tree in preorder ( depth-first )
Grammar
ET+T
T(E)
T-E
T  id
E
Leftmost derivation
E lm T + T
lm id + T
lm id + id
E
E
T
T
+
T
id +
E
T
T
T
id + id
Recursive Descent Parsing

Every nonterminal has one (recursive)
procedure responsible for parsing the
nonterminal’s syntactic category of input
tokens

When a nonterminal has multiple productions,
each production is implemented in a branch of
a selection statement based on input lookahead information
Recursive Descent Parsing
void A() {
Choose an A-Production, AX1X2…Xk;
for (i = 1 to k)
{
if ( Xi is a nonterminal)
call procedure Xi();
else if ( Xi = current input symbol a )
advance the input to the next symbol;
else
}
}
/* an error has occurred */
Conclusion: Parsing and Translation Scheme

Complete
import java.io.*;
class Parser {
static int lookahead;
public Parser() throws IOException {
lookahead = System.in.read();
}
void expr() {
term();
while ( true ) {
if ( lookahead == ‘+’ ) {
match(‘+’); term();
System.out.write(‘+’);
continue;
}
else if (lookahead == ‘-’) {
match(‘-’); term();
System.out.write(‘-’);
continue;
}
else return;
}
void term() throws IOException {
if (Character.isDigit((char)lookahead){
System.out.write((char)lookahead);
match(lookahead);
}
else throw new Error(“syntax error”);
}
void match(int t) throws IOException {
if ( lookahead == t )
lookahead = System.in.read();
else throw new Error(“syntax error”);
}
}
LL(1)
LL(1) Grammar

Predictive parsers, that is, recursive-descent parsers
needing no backtracking, can be constructed for a
class of grammars called LL(1)

First “L” means the input from left to right.

Second “L” means leftmost derivation.

“1” for using one input symbol of lookahead at each
step tp make parsing action decisions.

No left-recursive.

No ambiguous.
FIRST and FOLLOW
S
a
A
α
c
β
γ
c is in FIRST(A)
a is in FOLLOW(A)
FIRST and FOLLOW
The constructed of both top-down and bottomup parsers is aided by two functions, FIRST and
FOLLOW, associated with a grammar G.
 During top-down parsing, FIRST and FOLLOW
allow us to choose which production to apply.
 During panic-mode error recovery, sets of
tokens produced by FOLLOW can be used as
synchronizing tokens.

FIRST

FIRST()

The set of terminals that begin all strings derived from 

FIRST(a) = { a } if a  T

FIRST() = {  }

FIRST(A) = A FIRST () for A  P

FIRST(X1X2…Xk) =
if   FIRST (Xj) for all j = 1, …, i-1 then
add non- in FIRST(Xi) to FIRST(X1X2…Xk)
if   FIRST (Xj) for all j = 1, …, k then
add  to FIRST (X1X2…Xk)
FIRST(1)
 By
definition of the FIRST, we can
compute FIRST(X)
 If
XT, then FIRST(X) = {X}.
 If
XN, X→, then add  to FIRST(X).
XN, and X → Y1 Y2 . . . Yn, then add all non-
elements of FIRST(Y1) to FIRST(X), if
FIRST(Y1), then add all non- elements of
FIRST(Y2) to FIRST(X), ..., if FIRST(Yn), then
add  to FIRST(X).
 If
FOLLOW

FOLLOW(A)

the set of terminals that can immediately follow nonterminal A

FOLLOW(A) =
for all (B   A )  P do
add FIRST()-{} to FOLLOW(A)
for all (B   A )  P and   FIRST() do
add FOLLOW(B) to FOLLOW(A)
for all (B   A)  P do
add FOLLOW(B) to FOLLOW(A)
if A is the start symbol S then
add $ to FOLLOW(A)
FOLLOW(1)

By definition of the FOLLOW, we can
compute FOLLOW(X)
 Put
$ into FOLLOW(S).
each A B, add all non- elements of
FIRST() to FOLLOW(B).
 For
each A B or A B, where
FIRST(), add all of FOLLOW(A) to
FOLLOW(B).
 For
Example

Give a Grammar G
E → T E’
E’ → + T E’ | ε
T → F T’
FIRST
E (
E’ +
T (
T’ *
F (
T’ → * F T’ | ε
F → ( E ) | id
id

id

id
FOLLOW
E
E’
T
T’
F
$
)
+
*
$
$
+
+
)
)
$ )
$ )
Using FIRST and FOLLOW to Write a
Recursive Descent Parser
rest()
{
if (lookahead in FIRST(+ term rest) ) {
match(‘+’); term(); rest()
}
else if (lookahead in FIRST(- term rest) ) {
match(‘-’); term(); rest()
}
else if (lookahead in FOLLOW(rest) )
return
else error()
expr  term rest
rest  + term rest
| - term rest
| 
term  id
}
FIRST(+ term rest) = { + }
FIRST(- term rest) = { - }
FOLLOW(rest) = { $ }
Download