id - New York University

advertisement
Parsing
G22.2110 Programming Languages
May 24, 2012
New York University
Chanseok Oh (chanseok@cs.nyu.edu)
• Chapter 2
Scanning
Parsing
• Overview
– Scanner, Tokenizer, Lexer, Lexical Analyzer
IF ( A >= .30 ) THEN { …
IF, LPARAN, IDENT(A), GTE, FPN(.30), RPARAN, THEN, …
• Tokens, Lexemes
• DFA , NFA, Regular expressions
• lex, flex, Jlex
– Parser
• DPDA, Deterministic context-free grammars
• Yacc, Bison
• Table of Contents
– Practical parsers (Linear time)
• LL (top-down, predictive)
• LR (bottom-up, shift-reduce)
– Related side-topics
• Ambiguity, Language and parser hierarchy
– Examples: Simple Calculator Language
• A Language
– A set of strings (of given symbols)
•
•
•
•
•
{
{
{
{
{
finite, set, with, five, strings }
ab, aaba, abbaba, … }
0n1n }
aibj | i < j }
void main() { int i = 0 }, … }
– Is an input string in the language?
• cf. Recursive, Turing-decidable languages
• Context-Free Languages (CFL)
– Languages that can be generated by
• CFG’s
– Languages that can be determined by
• PDA’s
– Not all languages are CF.
– CFG: suitable for most PL’s.
• <sentence> := <subject> <verb> <object> PERIOD
– Deterministic CFL
• Example
Here is our CFG:
S
A
A
Input:
:=
:=
:=
id
, id
;
A
A
sum , a1 , ptr ;
S :=
A :=
A :=
• Parse Tree
S
id
, id
;
A
sum
,
A
a1
,
ptr
A
;
A
A
• Ambiguous Grammars
E
E
E
E
E
+ E
– E
* E
/ E
– Is it ambiguous? Undecidable.
– No general procedure for converting to
unambiguous grammars
– Can be allowed to some extent for deterministic
parsing, e.g., by defining precedence or
associativity.
• Parsers
– LL (Left-to-right, Left-most derivation)
• Top-down
• Predictive
• Simple and easy to understand
– LR (Left-to-right, Right-most derivation)
• Bottom-up
• Shift-reduce
• Most common in production-level
• SLR (Simple)
• LALR (Look-ahead)
• LL(k) Parser
– LL(k) Parser
• Uses  k look-ahead symbols
• Does not backtrack (deterministic).
– LL(1) is the most popular kind of LL parser.
– LL(k) Languages
• Not all CFL’s are LL(k) languages.
CFL
LL(k)
LL  CFL
• LL Parsing Example
<id_list>
:=
<id_list_tail> :=
<id_list_tail> :=
id
<id_list_tail>
, id <id_list_tail>
;
It is an LL grammar.
The language is also LL.
Input to parse:
sum , a1 , ptr ;
CFL
LL
•
• Parse Tree
<id_list>
:= id <id_list_tail>
<id_list_tail> := , id <id_list_tail>
<id_list_tail> := ;
<id_list>
<id_list_tail>
<id_list_tail>
<id_list_tail>
sum
,
a1
,
ptr
;
• LR Parser
– LR(k) parser
• Uses  k look-ahead symbols.
• Usually k is 1, and the term LR Parser is often intended
to refer to this case.
– LR(k) Languages
• Not all CFL’s are LR(k) languages.
CFL
LR
Language Relationships
Unambiguous languages
Ambiguous languages
LL(1)
LR(1) LALR
SLR
LR(0)
LL(0)
• LR Parsing Example
With the same grammar,
id_list
id_list_tail
id_list_tail
id id_list_tail
, id id_list_tail
;
It is also an LR grammar,
and the language is LR.
CFL
LR(1)
LL
•
Input to parse (as before):
sum , a1 , ptr ;
• Parse Tree
<id_list>
:= id <id_list_tail>
<id_list_tail> := , id <id_list_tail>
<id_list_tail> := ;
<id_list>
<id_list_tail>
<id_list_tail>
<id_list_tail>
sum
,
a1
,
ptr
;
• Another LR Parsing Example
Consider a modified grammar,
<id_list>
:=
<id_list_prefix> :=
<id_list_prefix> :=
<id_list_prefix> ;
<id_list_prefix> , id
id
The grammar is not LL,
(though the language itself is both LR and LL).
<id_list>
<id_list_prefix>
<id_list_prefix>
:=
:=
:=
<id_list_prefix> ;
<id_list_prefix> , id
id
<id_list>
• LR Parsing
<id_list_prefix>
<id_list_prefix>
<id_list_prefix>
sum
,
a1
,
ptr
;
• Simple Calculator Language
3+(4*1)
total := 7
read n
write ( 10 – ( total + 1 ) / 3 * n )
• Simple Arithmetic Expression
E
E + E | E – E
E * E | E / E
E
id | number | ( E )
• Simple Arithmetic Expression
expr
term
factor
add_op
mult_op
term | expr add_op term
factor | term mult_op factor
id | number | ( expr )
+ | * | /
– LL language, but not LL grammar (yet LR one)
– Two most common obstacles to “LL(1)-ness”
• Left-recursion
• Common prefixes
stmt
stmt stmt_list
id := expr
id ( arg_list )
• Converting to LL-Grammars
stmt
stmt_list
stmt
stmt
stmt_list_tail
stmt stmt_list
stmt stmt_list | є
id := expr
id ( arg_list )
id | stmt_list_tail
:= expr | ( arg_list )
– Alternatively, you can employ conflict-resolution rules.
• Converted LL(1) Grammar
expr
term_tail
term
factor_tail
factor
add_op
mult_op
Not every CFG can
be converted to LL
grammar. Why?
term term_tail
add_op term term_tail | є
factor | factor_tail
mult_op factor factor_tail | є
( expr ) | id | number
+ | * | /
CFL
LL
• LL(1) for Simple Calculator Language
program
stmt_list
stmt
expr
term_tail
term
factor_tail
factor
add_op
mult_op
stmt_list $$
stmt stmt_list | є
id := expr | read id | write expr
term term_tail
add_op term term_tail | є
factor factor_tail
mult_op factor factor_tail | є
( expr ) | id | number
+ | * | /
Added three more production rules to the previous LL(1)
grammar for expressions.
• LL Parsing
– Input program
read A
read B
sum := A + B
write sum
write sum / 2
• Predict Sets
program
stmt_list
stmt
expr
term_tail
term
factor_tail
factor
add_op
mult_op
stmt_list $$ {id, read, write, $$}
stmt stmt_list {id, read, write} | є {$$}
id := expr {id}
read id {read} | write expr {write}
term term_tail {(, id, number}
add_op term term_tail {+,-}
є {), id, read, write, $$}
factor factor_tail {(, id, number}
mult_op factor factor_tail {*, /}
є {+, -, ), id, read, write, $$}
( expr ) {(} | id {id} | number {number}
+ {+} | - {-}
* {*} | / {/}
• Predict Sets
stmt
id := expr {id}
read id {read}
write expr {write}
– Notice the pair-wise disjoint sets:
{id}, {read} ,{write}
– You are to expand stmt.
– Look ahead 1 token (LL(1)).
• LL(1)
program
stmt_list
stmt
expr
term_tail
term
factor_tail
factor
add_op
mult_op
stmt_list $$
stmt stmt_list | є
id := expr | read id | write expr
term term_tail
add_op term term_tail | є
factor factor_tail
mult_op factor factor_tail | є
( expr ) | id | number
+ | * | /
• Better grammar: LR(1)
program
stmt_list
stmt
expr
term
factor
add_op
mult_op
stmt_list $$
stmt_list stmt | stmt
id := expr | read id | write expr
term | expr add_op term
factor | term mult_op factor
id | number | ( expr )
+ | * | /
– More intuitive than LL
• However, not exactly the same language (no empty
string)
– Left-recursive is advantageous.
• LR Parsing
– With the same input program,
read A
read B
sum := A + B
write sum
write sum / 2
• State Transition Diagram
State 0
program
stmt_list
stmt
(Initial state)
● stmt_list $$
● stmt_list stmt
● stmt
● id := expr
● read id
● write expr
read
State 0’
stmt_list
stmt ●
stmt
Reduce
(shifting stmt_list)
stmt_list
program
stmt_list
stmt
State 1
stmt
read ● id
id
State 1’
stmt
read id ●
Reduce
(shifting stmt from a
viewpoint of State 0)
State 2
stmt_list ● $$
stmt_list ● stmt
● id := expr
● read id
● write expr
• Shift/Reduce Conflicts
expr
factor
…
● term
id ●
• Reduce/Reduce Conflicts
expr
factor
id ●
id ●
• Resolving Conflicts
• LR(0)
– Any LR language has an LR(0) grammar (with $$).
– Not practical: prohibitively large and unintuitive
• SLR
– SLR grammar: no shift/reduce or reduce/reduce conflicts when
using FOLLOW sets
– FOLLOW sets: also used in LL to generate PREDICT sets
• LALR(1)
–
–
–
–
LALR(1) grammar (may not be SLR)
Same states as SLR
Improvement over SLR with local look-ahead
LALR’s are the most common parsers in practice.
• LR(1)
– LR(1) grammars (may not be LALR(1) or SLR)
Download