PPTX

advertisement
Compilation
0368-3133
Lecture 3:
Syntax Analysis
Top Down parsing
Bottom Up parsing
Noam Rinetzky
1
The Real Anatomy of a Compiler
txt
Source
text
Process
text
input
characters
Lexical
Analysis
tokens
Syntax
Analysis
AST
Sem.
Analysis
Annotated AST
Intermediate
code
generation
IR
Intermediate
code
optimization
IR
Code
generation
Symbolic Instructions
Target code
optimization
SI
Machine code
generation
MI
Write
executable
output
exe
Executable
code
2
Context free grammars (CFG)
G = (V,T,P,S)
• V – non terminals (syntactic variables)
• T – terminals (tokens)
• P – derivation rules
– Each rule of the form V  (T ∪ V)*
• S – start symbol
3
Pushdown Automata (PDA)
• Nondeterministic PDAs define all CFLs
• Deterministic PDAs model parsers.
– Most programming languages have a
deterministic PDA
– Efficient implementation
4
CFG terminology
SS;S
S  id := E
E  id
E  num
EE+E
Symbols:
Terminals (tokens): ; := ( ) id num print
Non-terminals: S E L
Start non-terminal: S
Convention: the non-terminal appearing
in the first derivation rule
Grammar productions (rules)
Nμ
5
CFG terminology
• Derivation - a sequence of replacements of
non-terminals using the derivation rules
• Language - the set of strings of terminals
derivable from the start symbol
• Sentential form - the result of a partial
derivation
– May contain non-terminals
6
Derivations
• Show that a sentence ω is in a grammar G
– Start with the start symbol
– Repeatedly replace one of the non-terminals
by a right-hand side of a production
– Stop when the sentence contains only
terminals
• Given a sentence αNβ and rule Nµ
αNβ => αµβ
• ω is in L(G) if S =>* ω
7
Parse trees: Traces of
Derivations
• Tree nodes are symbols, children ordered left-to-right
• Each internal node is non-terminal
– Children correspond to one of its productions
N
N  µ1 … µk
µ1
…
µk
• Root is start non-terminal
• Leaves are tokens
• Yield of parse tree: left-to-right walk over leaves
8
Parse Tree
S
S
S ; S
id := E ; S
S
id := id ; S
id := id ;id := E
S
E
id
:=
E
id := id ;id := E + E
id := id ;id := E + id
E
id := id ;id := id + id
x:= z ; y := x
E
+ z
id
:=
id
;
id
+
id
9
Ambiguity
S S ; S
S  id := E | …
E  id | E + E | E * E | …
x := y+z*w
S
id
S
:=
E
E
+
E
id
E
*
id
id
:=
E
*
E
E
E
id
id
+
E
E
id
id
10
Leftmost/rightmost Derivation
• Leftmost derivation
– always expand leftmost non-terminal
• Rightmost derivation
– Always expand rightmost non-terminal
• Allows us to describe derivation by listing the
sequence of rules
– always know what a rule is applied to
11
Broad kinds of parsers
• Parsers for arbitrary grammars
– Earley’s method, CYK method
– Usually, not used in practice (though might change)
• Top-down parsers
– Construct parse tree in a top-down matter
– Find the leftmost derivation
• Bottom-up parsers
– Construct parse tree in a bottom-up manner
– Find the rightmost derivation in a reverse order
12
Top-Down Parsing: Predictive parsing
• Recursive descent
• LL(k) grammars
13
Predictive parsing
• Given a grammar G and a word w attempt to derive
w using G
• Idea
– Apply production to leftmost nonterminal
– Pick production rule based on next input token
• General grammar
– More than one option for choosing the next
production based on a token
• Restricted grammars (LL)
– Know exactly which single rule to apply
– May require some lookahead to decide
14
Recursive descent parsing
• Define a function for every nonterminal
• Every function work as follows
– Find applicable production rule
– Terminal function checks match with next
input token
– Nonterminal function calls (recursively) other
functions
• If there are several applicable productions
for a nonterminal, use lookahead
15
Recursive descent
void A() {
choose an A-production, A  X1X2…Xk;
for (i=1; i≤ k; i++) {
if (Xi is a nonterminal)
call procedure Xi();
elseif (Xi == current)
advance input;
else
report error;
}
}
• How do you pick the right A-production?
• Generally – try them all and use
backtracking
• In our case – use lookahead
16
Problem 1: productions with common prefix
term  ID | indexed_elem
indexed_elem  ID [ expr ]
• The function for indexed_elem will never be tried…
– What happens for input of the form ID[expr]
17
Problem 2: null productions
SAab
Aa|
int S() {
return A() && match(token(‘a’)) && match(token(‘b’));
}
int A() {
return match(token(‘a’)) || 1;
}
 What happens for input “ab”?
 What happens if you flip order of alternatives and try “aab”?
18
Problem 3: left recursion
p. 127
E  E - term | term
int E() {
return E() && match(token(‘-’)) && term();
}
 What happens with this procedure?
 Recursive descent parsers cannot handle left-recursive grammars
19
Constructing an LL(1) Parser
20
FIRST sets
X  YY | Z Z | Y Z | 1 Y
Y4|ℇ
Z2
L(Z) = {2}
L(Y) = {4, ℇ}
L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1}
21
FIRST sets
X  YY | Z Z | Y Z | 1 Y
Y4|ℇ
Z2
L(Z) = {2}
L(Y) = {4, ℇ}
L(Y) = {44, 4, ℇ, 22, 42, 2, 14, 1}
22
FIRST sets
• FIRST(X) = { t | X * t β} ∪{ℇ | X * ℇ}
– FIRST(X) = all terminals that α can appear as
first in some derivation for X
• + ℇ if can be derived from X
• Example:
– FIRST( LIT ) = { true, false }
– FIRST( ( E OP E ) ) = { ‘(‘ }
– FIRST( not E ) = { not }
E  LIT | (E OP E) | not E
LIT  true | false
OP  and | or | xor
23
Computing FIRST sets
• FIRST (t) = { t } // “t” non terminal
• ℇ∈ FIRST(X) if
– X  ℇ or
– X  A1 .. Ak and ℇ∈ FIRST(Ai) i=1…k
• FIRST(α) ⊆ FIRST(X) if
– X A1 .. Ak α and ℇ∈ FIRST(Ai) i=1…k
25
Computing FIRST sets
• Assume no null productions A  
1.
2.
Initially, for all nonterminals A, set
FIRST(A) = { t | A  tω for some ω }
Repeat the following until no changes occur:
for each nonterminal A
for each production A  Bω
set FIRST(A) = FIRST(A) ∪ FIRST(B)
• This is known as fixed-point computation
26
FIRST sets computation example
STMT  if EXPR then STMT
| while EXPR do STMT
| EXPR ;
EXPR  TERM -> id
| zero? TERM
| not EXPR
| ++ id
| -- id
TERM  id
| constant
STMT
EXPR
TERM
27
1. Initialization
STMT  if EXPR then STMT
| while EXPR do STMT
| EXPR ;
EXPR  TERM -> id
| zero? TERM
| not EXPR
| ++ id
| -- id
TERM  id
| constant
STMT
EXPR
TERM
if
while
zero?
Not
++
--
id
constant
28
2. Iterate 1
STMT  if EXPR then STMT
| while EXPR do STMT
| EXPR ;
EXPR  TERM -> id
| zero? TERM
| not EXPR
| ++ id
| -- id
TERM  id
| constant
STMT
EXPR
TERM
if
while
zero?
Not
++
--
id
constant
zero?
Not
++
--
29
2. Iterate 2
STMT  if EXPR then STMT
| while EXPR do STMT
| EXPR ;
EXPR  TERM -> id
| zero? TERM
| not EXPR
| ++ id
| -- id
TERM  id
| constant
STMT
EXPR
TERM
if
while
zero?
Not
++
--
id
constant
zero?
Not
++
--
id
constant
30
2. Iterate 3 – fixed-point
STMT  if EXPR then STMT
| while EXPR do STMT
| EXPR ;
EXPR  TERM -> id
| zero? TERM
| not EXPR
| ++ id
| -- id
TERM  id
| constant
STMT
EXPR
TERM
if
while
zero?
Not
++
--
id
constant
zero?
Not
++
--
id
constant
id
constant
31
FOLLOW sets
p. 189
• What do we do with nullable () productions?
– ABCD BC
– Use what comes afterwards to predict the right
production
• For every production rule A  α
– FOLLOW(A) = set of tokens that can immediately
follow A
• Can predict the alternative Ak for a non-terminal N
when the lookahead token is in the set
– FIRST(Ak)  (if Ak is nullable then FOLLOW(N))
32
FOLLOW sets: Constraints
• $ ∈ FOLLOW(S)
• FIRST(β) – {ℇ} ⊆ FOLLOW(X)
– For each A  α X β
• FOLLOW(A) ⊆ FOLLOW(X)
– For each A  α X β and ℇ ∈ FIRST(β)
33
Example: FOLLOW sets
• E TX
• T (E) | int Y
X+ E | ℇ
Y*T|ℇ
Terminal +
(
*
)
int
FOLLOW int, (
int, (
int, (
_, ), $
*, ), +, $
Non.
E
Term.
FOLLOW ), $
T
X
Y
+, ), $
$, )
_, ), $
34
Prediction Table
• Aα
• T[A,t] = α if t ∈FIRST(α)
• T[A,t] = α if ℇ ∈ FIRST(α) and t ∈ FOLLOW(A)
– t can also be $
• T is not well defined  the grammar is not LL(1)
35
LL(k) grammars
• A grammar is in the class LL(K) when it can
be derived via:
–
–
–
–
Top-down derivation
Scanning the input from left to right (L)
Producing the leftmost derivation (L)
With lookahead of k tokens (k)
• A language is said to be LL(k) when it has an
LL(k) grammar
36
LL(1) grammars
• A grammar is in the class LL(K) iff
– For every two productions A  α and A  β we have
• FIRST(α) ∩ FIRST(β) = {} // including 
• If  ∈ FIRST(α) then FIRST(β) ∩ FOLLOW(A) = {}
• If  ∈ FIRST(β) then FIRST(α) ∩ FOLLOW(A) = {}
37
Problem: Non LL Grammars
SAab
Aa|
bool S() {
return A() && match(token(‘a’)) && match(token(‘b’));
}
bool A() {
return match(token(‘a’)) || true;
}
 What happens for input “ab”?
 What happens if you flip order of alternatives and try “aab”?
38
Problem: Non LL Grammars
SAab
Aa|
• FIRST(S) = { a }
• FIRST(A) = { a,  }
FOLLOW(S) = { $ }
FOLLOW(A) = { a }
• FIRST/FOLLOW conflict
39
Back to problem 1
term  ID | indexed_elem
indexed_elem  ID [ expr ]
• FIRST(term) = { ID }
• FIRST(indexed_elem) = { ID }
• FIRST/FIRST conflict
40
Solution: left factoring
• Rewrite the grammar to be in LL(1)
term  ID | indexed_elem
indexed_elem  ID [ expr ]
term  ID after_ID
After_ID  [ expr ] | 
Intuition: just like factoring x*y + x*z into x*(y+z)
41
Left factoring – another example
S  if E then S else S
| if E then S
|T
S  if E then S S’
|T
S’  else S | 
42
Back to problem 2
SAab
Aa|
• FIRST(S) = { a }
• FIRST(A) = { a ,  }
FOLLOW(S) = { }
FOLLOW(A) = { a }
• FIRST/FOLLOW conflict
43
Solution: substitution
SAab
Aa|
Substitute A in S
Saab|ab
Left factoring
S  a after_A
after_A  a b | b
44
Back to problem 3
E  E - term | term
• Left recursion cannot be handled with a
bounded lookahead
• What can we do?
45
Left recursion removal
N  Nα | β
p. 130
N  βN’
N’  αN’ | 
G1
• L(G1) = β, βα, βαα, βααα, …
• L(G2) = same
G2
Can be done algorithmically.
Problem: grammar becomes
mangled beyond recognition
 For our 3rd example:
E  E - term |
term
E  term TE | term
TE  - term TE | 
46
LL(k) Parsers
• Recursive Descent
– Manual construction
– Uses recursion
• Wanted
– A parser that can be generated automatically
– Does not use recursion
47
LL(k) parsing via pushdown
automata
• Pushdown automaton uses
– Prediction stack
– Input stream
– Transition table
• nonterminals x tokens -> production alternative
• Entry indexed by nonterminal N and token t
contains the alternative of N that must be
predicated when current input starts with t
48
LL(k) parsing via pushdown
automata
• Two possible moves
– Prediction
• When top of stack is nonterminal N, pop N, lookup table[N,t].
If table[N,t] is not empty, push table[N,t] on prediction stack,
otherwise – syntax error
– Match
• When top of prediction stack is a terminal T, must be equal to
next input token t. If (t == T), pop T and consume t. If (t ≠ T)
syntax error
• Parsing terminates when prediction stack is empty
– If input is empty at that point, success. Otherwise,
syntax error
49
Example transition table
Nonterminals
(1) E → LIT
(2) E → ( E OP E )
(3) E → not E
(4) LIT → true
(5) LIT → false
(6) OP → and
(7) OP → or
(8) OP → xor
(
E
LIT
OP
2
)
Which rule
should be used
Input tokens
not
true
false
3
1
1
4
5
and
or
xor
6
7
8
$
50
Model of non-recursive
predictive parser
a
Stack
X
Y
+
b
Predictive Parsing
program
$
Output
Z
$
Parsing Table
51
Running parser example
A  aAb | c
aacbb$
Input suffix
Stack content
Move
aacbb$
A$
predict(A,a) = A  aAb
aacbb$
aAb$
match(a,a)
acbb$
Ab$
predict(A,a) = A  aAb
acbb$
aAbb$
match(a,a)
cbb$
Abb$
predict(A,c) = A  c
cbb$
cbb$
match(c,c)
bb$
bb$
match(b,b)
b$
b$
match(b,b)
$
$
match($,$) – success
a
A
A  aAb
b
c
Ac
52
Erorrs
53
Handling Syntax Errors
•
•
•
•
Report and locate the error
Diagnose the error
Correct the error
Recover from the error in order to discover
more errors
– without reporting too many “strange” errors
54
Error Diagnosis
• Line number
– may be far from the actual error
• The current token
• The expected tokens
• Parser configuration
55
Error Recovery
• Becomes less important in interactive
environments
• Example heuristics:
– Search for a semi-column and ignore the statement
– Try to “replace” tokens for common errors
– Refrain from reporting 3 subsequent errors
• Globally optimal solutions
– For every input w, find a valid program w’ with a
“minimal-distance” from w
56
Illegal input example
A  aAb | c
abcbb$
Input suffix
Stack content
Move
abcbb$
A$
predict(A,a) = A  aAb
abcbb$
aAb$
match(a,a)
bcbb$
Ab$
predict(A,b) = ERROR
a
A
A  aAb
b
c
Ac
57
Error handling in LL parsers
S a c | b S
c$
Input suffix
Stack content
Move
c$
S$
predict(S,c) = ERROR
• Now what?
– Predict b S anyway “missing token b inserted in line XXX”
S
a
b
Sac
SbS
c
58
Error handling in LL parsers
S a c | b S
c$
Input suffix
Stack content
Move
bc$
S$
predict(b,c) = S  bS
bc$
bS$
match(b,b)
c$
S$
Looks familiar?
• Result: infinite loop
S
a
b
Sac
SbS
c
59
Error handling and recovery
• x = a * (p+q * ( -b * (r-s);
• Where should we report the error?
• The valid prefix property
60
The Valid Prefix Property
• For every prefix tokens
– t1, t2, …, ti that the parser identifies as legal:
• there exists tokens ti+1, ti+2, …, tn such that t1, t2, …, tn
is a syntactically valid program
• If every token is considered as single character:
– For every prefix word u that the parser identifies as legal
there exists w such that u.w is a valid program
61
Recovery is tricky
• Heuristics for dropping tokens, skipping to
semicolon, etc.
62
Building the Parse Tree
63
Adding semantic actions
• Can add an action to perform on each
production rule
• Can build the parse tree
– Every function returns an object of type Node
– Every Node maintains a list of children
– Function calls can add new children
64
Building the parse tree
Node E() {
result = new Node();
result.name = “E”;
if (current  {TRUE, FALSE}) // E  LIT
result.addChild(LIT());
else if (current == LPAREN) // E  ( E OP E )
result.addChild(match(LPAREN));
result.addChild(E());
result.addChild(OP());
result.addChild(E());
result.addChild(match(RPAREN));
else if (current == NOT)
// E  not E
result.addChild(match(NOT));
result.addChild(E());
else error;
return result;
}
65
Bottom-up parsing
67
Intuition: Bottom-Up Parsing
• Begin with the user's program
• Guess parse (sub)trees
• Check if root is the start symbol
68
Bottom-up parsing
Unambiguous
grammar
EE*T
ET
TT+F
TF
F  id
F  num
F(E)
1
+
2
*
3
69
Bottom-up parsing
Unambiguous
grammar
EE*T
ET
TT+F
TF
F  id
F  num
F(E)
F
1
+
2
*
3
70
Unambiguous
grammar
EE*T
ET
TT+F
TF
F  id
F  num
F(E)
Bottom-up parsing
T
T
F
1
F
+
2
F
*
3
71
Top-Down vs Bottom-Up
• Top-down (predict match/scan-complete )
already
read…
to be
read…
A aAb|c
aacbb$
A
a A b
a A b
c
72
Top-Down vs Bottom-Up
• Top-down (predict match/scan-complete )
already
read…
to be
read…
A aAb|c
aacbb$
A
a A b
a A b
c
 Bottom-up (shift reduce)
A
a A b
a A b
c
73
Bottom-up parsing:
LR(k) Grammars
• A grammar is in the class LR(K) when it can be
derived via:
–
–
–
–
Bottom-up derivation
Scanning the input from left to right (L)
Producing the rightmost derivation (R)
With lookahead of k tokens (k)
• A language is said to be LR(k) if it has an LR(k)
grammar
• The simplest case is LR(0), which we will discuss
74
Terminology: Reductions & Handles
• The opposite of derivation is called
reduction
– Let A  α be a production rule
– Derivation: βAµ  βαµ
– Reduction: βαµ  βAµ
• A handle is the reduced substring
– α is the handles for βαµ
75
Goal: Reduce the Input to the Start Symbol
E→E*B|E+B|B
B→0|1
Example:
0+0*1
B+0*1
E+0*1
E+B*1
E*1
E*B
E
E
* B
E
E + B
B
1
0
0
Go over the input so far, and upon seeing a right-hand side of a rule, “invoke”
the rule and replace the right-hand side with the left-hand side (reduce)
76
Use Shift & Reduce
In each stage, we
shift a symbol from the input to the stack, or
reduce according to one of the rules.
77
Use Shift & Reduce
Example: “0+0*1”
In each stage, we
shift a symbol from the input to the stack, or E → E * B | E + B | B
B→0|1
reduce according to one of the rules.
Stack
E
*
E
E
B
0
+
B
0
B
1
0
B
E
E+
E+0
E+B
E
E*
E*1
E*B
E
Input
0+0*1$
+0*1$
+0*1$
+0*1$
0*1$
*1$
*1$
*1$
1$
$
$
$
action
shift
reduce
reduce
shift
shift
reduce
reduce
shift
shift
reduce
reduce
accept
78
How does the parser know what to do?
token stream
(
(
23
+
7
)
*
x
)
LP
LP
Num
OP
Num
RP
OP
Id
RP
Input
Action Table
Parser
Stack
Output
Goto table
Op(*)
Op(+)
Id(b)
79
Num(23) Num(7)
How does the parser know what to do?
• A state will keep the info gathered on handle(s)
– A state in the “control” of the PDA
– Also (part of) the stack alpha beit
Set of LR(0) items
• A table will tell it “what to do” based on current
state and next token
– The transition function of the PDA
• A stack will records the “nesting level”
– Prefixes of handles
80
LR item
Already matched
To be matched
Input
N  αβ
Hypothesis about αβ being a possible handle, so far we’ve matched
α, expecting to see β
81
Example: LR(0) Items
• All items can be obtained by placing a dot at
every position for every production:
Grammar (1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
LR(0) items
1: S  E$
2: S  E  $
3: S  E $ 
4: E   T
5: E  T 
6: E   E + T
7: E  E  + T
8: E  E +  T
9: E  E + T 
10: T   i
11: T  i 
12: T   (E)
13: T  ( E)
14: T  (E )
15: T  (E) 
82
LR(0) items
N  αβ
Shift Item
N  αβ
Reduce Item
83
States and LR(0) Items
E→E*B|E+B|B
B→0|1
• The state will “remember” the potential derivation rules
given the part that was already identified
• For example, if we have already identified E then the state
will remember the two alternatives:
(1) E → E * B, (2) E → E + B
• Actually, we will also remember where we are in each of
them:
(1) E → E ● * B,
(2) E → E ● + B
• A derivation rule with a location marker is called LR(0)
item.
• The state is actually a set of LR(0) items. E.g.,
q13 = { E → E ● * B , E → E ● + B}
84
Intuition
• Gather input token by token until we find a
right-hand side of a rule and then replace it
with the non-terminal on the left hand side
– Going over a token and remembering it in the
stack is a shift
• Each shift moves to a state that remembers what
we’ve seen so far
– A reduce replaces a string in the stack with the
non-terminal that derives it
85
Model of an LR parser
Input
Stack
state
0
symbol
T
id
+
id
+
id
LR
Parsing
program
$
Output
2
Terminals and
Non-terminals
+
7
action
goto
id
5
86
LR parser stack
• Sequence made of state, symbol pairs
• For instance a possible stack for the
grammar
SE$
ET
EE+T
T  id
T(E)
could be: 0 T 2 + 7 id 5
Stack grows this way
87
Form of LR parsing table
state
0
1
.
.
.
non-terminals
terminals
Shift/Reduce actions
Goto part
acc
rk
sn
shift state n
gm
error
reduce by rule k
accept
goto state m
88
LR parser table example
STATE
action
id
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
89
Shift move
Input
Stack
q
.
.
.
…
a
…
LR
Parsing
program
action
$
Output
goto
• If action[q, a] = sn
90
Result of shift
Input
Stack
n
a
…
a
…
LR
Parsing
program
$
Output
q
.
.
.
action
goto
• If action[q, a] = sn
91
Reduce move
Input
Stack
qn
2*|β|
…
q
…
a
…
LR
Parsing
program
action
$
Output
goto
…
• If action[qn, a] = rk
• Production: (k) A  β
• If β= σ1… σn
Top of stack looks like q1 σ1… qn σn
• goto[q, A] = qm
92
Result of reduce move
Input
Stack
2*|β|
qm
…
a
…
LR
Parsing
program
$
Output
A
q
action
goto
…
• If action[qn, a] = rk
• Production: (k) A  β
• If β= σ1… σn
Top of stack looks like q1 σ1… qn σn
• goto[q, A] = qm
93
Accept move
Input
…
Stack
a
$
LR
Parsing
program
q
.
.
.
action
Output
goto
If action[q, a] = accept
parsing completed
94
Error move
Input
Stack
q
.
.
.
…
a
…
LR
Parsing
program
action
$
Output
goto
If action[q, a] = error (usually empty)
parsing discovered a syntactic error
95
Example
Z  E $
E  T | E + T
T i | ( E )
96
Example: parsing with LR items
Z  E $
E  T | E + T
T  i | ( E )
Z
E
E
T
T





E $
T
E + T
i
( E )
i
+
i
$
Why do we need these additional LR items?
Where do they come from?
What do they mean?
97
-closure
• Given a set S of LR(0) items
• If P  αNβ is in S
• then for each rule N  α in the grammar
S must also contain N   α
-closure({Z  E $}) = { Z
E
E
Z  E $
E  T | E + T
T
T  i | ( E )
T





E $,
T,
E + T,
i ,
( E ) }
98
Example: parsing with LR items
i
+
i
$
Remember position from
which we’re trying to
reduce
ZE$
ET|E+T
Ti|(E)
Items denote possible
future handles
Z  E $
E  T
E  E + T
T  i
T  ( E )
99
Example: parsing with LR items
i
+
i
$
ZE$
ET|E+T
Ti|(E)
Match items with
current token
Z  E $
E  T
E  E + T
T  i
T  ( E )
T  i
Reduce item!
100
Example: parsing with LR items
T
+
i
$
ZE$
ET|E+T
Ti|(E)
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
E  T
Reduce item!
101
Example: parsing with LR items
E
+
i
$
ZE$
ET|E+T
Ti|(E)
T
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
E  T
Reduce item!
102
Example: parsing with LR items
E
+
i
$
ZE$
ET|E+T
Ti|(E)
T
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
E  E+ T
103
Example: parsing with LR items
E
+
i
$
ZE$
ET|E+T
Ti|(E)
T
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
E  E+ T
E  E+T
T  i
T  ( E )
104
Example: parsing with LR items
E
+
T
T
$
i
ZE$
ET|E+T
Ti|(E)
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
E  E+ T
E  E+T
T  i
T  ( E )
105
Example: parsing with LR items
E
T
+
T
$
ZE$
ET|E+T
Ti|(E)
i
i
Reduce item!
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
E  E+ T
E  E+T
E  E+T
T  i
T  ( E )
106
Example: parsing with LR items
E
ZE$
ET|E+T
Ti|(E)
$
E+T
T
i
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
E  E+ T
107
Example: parsing with LR items
E
ZE$
ET|E+T
Ti|(E)
$
E+T
T
i
Reduce item!
i
Z  E $
E  T
E  E + T
T  i
T  ( E )
Z  E$
Z  E$
E  E+ T
108
Example: parsing with LR items
Z
ZE$
ET|E+T
Ti|(E)
E
E $
E+T
T
Z  E $
E  T
E  E + T
T  i
T  ( E )
i
Reduce item!
i
Z  E$
Z  E$
E  E+ T
109
GOTO/ACTION tables
empty –
error move
GOTO Table
State
i
q0
q5
q1
+
(
)
$
q7
q3
E
T
action
q1
q6
shift
q2
shift
q2
q3
ACTION
Table
Z E$
q5
q7
q4
Shift
q4
E E+T
q5
T i
q6
E T
q7
q8
q9
q5
q7
q3
q8
q9
q6
shift
shift
T E
110
LR(0) parser tables
• Two types of rows:
– Shift row – tells which state to GOTO for
current token
– Reduce row – tells which rule to reduce
(independent of current token)
• GOTO entries are blank
111
LR parser data structures
• Input – remainder of text to be processed
• Stack – sequence of pairs N, qi
– N – symbol (terminal or non-terminal)
– qi – state at which decisions are made
+
input
stack
q0
i
i
$
q5
• Initial stack contains q0
112
LR(0) pushdown automaton
• Two moves: shift and reduce
• Shift move
–
–
–
–
–
Remove first token from input
Push it on the stack
Compute next state based on GOTO table
Push new state on the stack
If new state is error – report error
input
i
stack
q0
+
i
shift
$
stack
State
i
q0
q5
+
(
q7
)
$
+
input
E
T
action
q1
q6
shift
q0
i
i
$
q5
113
LR(0) pushdown automaton
• Reduce move
– Using a rule N  α
– Symbols in α and their following states are removed from stack
– New state computed based on GOTO table (using top of stack,
before pushing N)
– N is pushed on the stack
– New state pushed on top of N
+
input
stack
q0
i
i
Reduce
Ti
$
q5
+
input
q0
stack
State
i
q0
q5
+
(
q7
)
$
E
T
action
q1
q6
shift
T
i
$
q6
114
GOTO/ACTION table
State
i
q0
s5
q1
+
(
)
$
s7
s3
T
s1
s6
r1
r1
s2
q2
r1
q3
s5
q4
r3
r3
r3
r3
r3
r3
r3
q5
r4
r4
r4
r4
r4
r4
r4
q6
r2
r2
r2
r2
r2
r2
r2
q7
s5
s8
s6
r5
r5
q8
q9
(1) Z
(2) E
(3) E
(4) T
(5) T
r1
E
r1
 E $
 T
 E + T
 i
( E )
r1
s7
s4
s7
s3
r5
r1
r5
s9
r5
r5
r5
Warning: numbers mean different things!
rn = reduce using rule number n
sm = shift to state m
115
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
action
id
id + id $ s5
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
Initialize with state 0
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
116
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
action
id
id + id $ s5
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
Initialize with state 0
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
117
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
0 id 5
Input
Action
S
id
id + id $ s5
+ id $ r4
action
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
118
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
id
id + id $ s5
0 id 5
+ id $ r4
action
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
pop id 5
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
119
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
id
id + id $ s5
0 id 5
+ id $ r4
action
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
push T 6
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
120
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
id
id + id $ s5
0 id 5
+ id $ r4
0T6
+ id $ r2
action
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
121
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
S
id
id + id $ s5
0 id 5
+ id $ r4
0T6
+ id $ r2
0E1
+ id $ s3
action
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
122
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
+ id $ r4
0T6
+ id $ r2
0E1
+ id $ s3
id $ s5
action
id
id + id $ s5
0 id 5
0E1+3
S
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
123
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
+ id $ r4
0T6
+ id $ r2
0E1
+ id $ s3
0 E 1 + 3 id 5
id $ s5
$ r4
action
id
id + id $ s5
0 id 5
0E1+3
S
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
124
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
+ id $ r4
0T6
+ id $ r2
0E1
+ id $ s3
id $ s5
0 E 1 + 3 id 5
$ r4
0E1+3T4
$ r3
action
id
id + id $ s5
0 id 5
0E1+3
S
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
6
r2
r2
r2
r2
r2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
125
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Parsing id+id$
Stack
0
Input
Action
+ id $ r4
0T6
+ id $ r2
0E1
+ id $ s3
id $ s5
action
id
id + id $ s5
0 id 5
0E1+3
S
0
+
s5
1
(
goto
)
$
s7
s3
E
T
g1
g6
acc
2
3
s5
4
r3
r3
r3
r3
r3
5
r4
r4
r4
r4
r4
r2
r2
r2
r2
0 E 1 + 3 id 5
$ r4
0E1+3T4
$ r3
6
r2
0E1
$ s2
7
s5
8
9
s7
s7
s3
r5
g4
r5
g8
g6
s9
r5
r5
r5
126
Constructing an LR parsing table
• Construct a (determinized) transition
diagram from LR items
• If there are conflicts – stop
• Fill table entries from diagram
127
LR item
Already matched
To be matched
Input
N  αβ
Hypothesis about αβ being a possible handle, so far we’ve matched
α, expecting to see β
128
Types of LR(0) items
N  αβ
Shift Item
N  αβ
Reduce Item
129
LR(0) automaton example
shift state
E  T
T
q0
Z  E$
E  T
E  E + T
T  i
T  (E)
Z  E$
E  E+ T
$
q5
i
q7
T  (E)
E  T
E  E + T
T  i
T  (E)
i
T  i
+
q3
q4
E  E+T
T  i
T  (E)
T
E
(
i
q2
Z  E$
T
(
E
q1
reduce state
q6
+
T  (E)
E  E+T
)
(
q8
q9
T  (E) 
E  E + T
130
Computing item sets
• Initial set
– Z is in the start symbol
– -closure({ Z  α | Z  α is in the grammar } )
• Next set from a set S and the next symbol X
– step(S,X) = { N  αXβ | N  αXβ in the item set S}
– nextSet(S,X) = -closure(step(S,X))
131
Operations for transition
diagram construction
• Initial = {S’  S$}
• For an item set I
Closure(I) = Closure(I) ∪
{X  µ is in grammar| N  αXβ in I}
• Goto(I, X) = { N  αXβ | N  αXβ in I}
132
Initial example
• Initial = {S  E $}
Grammar
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
133
Closure example
• Initial = {S  E $}
• Closure({S  E $}) = {
S  E $
E  T
E  E + T
T  id
T  ( E ) }
Grammar
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
134
Goto example
Grammar
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
• Initial = {S  E $}
• Closure({S  E $}) = {
S  E $
E  T
E  E + T
T  id
T  ( E ) }
• Goto({S  E $ , E  E + T, T  id}, E) =
{S  E $, E  E + T}
135
Constructing the transition diagram
• Start with state 0 containing item
Closure({S  E $})
• Repeat until no new states are discovered
– For every state p containing item set Ip, and
symbol N, compute state q containing item set
Iq = Closure(goto(Ip, N))
136
LR(0) automaton example
shift state
E  T
T
q0
Z  E$
E  T
E  E + T
T  i
T  (E)
Z  E$
E  E+ T
$
q5
i
q7
T  (E)
E  T
E  E + T
T  i
T  (E)
i
T  i
+
q3
q4
E  E+T
T  i
T  (E)
T
E
(
i
q2
Z  E$
T
(
E
q1
reduce state
q6
+
T  (E)
E  E+T
)
(
q8
q9
T  (E) 
E  E + T
137
q0
Automaton construction
example
S E$
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
Initialize
138
Automaton construction example
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
q0
S  E$
E  T
E  E + T
T  i
T  (E)
apply
Closure
139
Automaton construction example
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
q6
q0
E  T
T
S  E$
E  T
E  E + T
T  i
T  (E)
(
q5
i
T  i
T  (E)
E  T
E  E + T
T  i
T  (E)
E
q1
S  E$
E  E+ T
140
Automaton construction example
(1) S  E $
(2) E  T
(3) E  E + T
(4) T  id
(5) T  ( E )
q6
q0
S  E$
E  T
E  E + T
T  i
T  (E)
T
E  T
(
non-terminal transition
corresponds to goto
action in parse table
i
q1
terminal
Z transition
 E$
corresponds
to shift
E  E+
T
action in parse table
i
T  i
q3
q4
T
E  E + T
E
(
i
E  E+T
T  i
T  (E)
+
$
S  E$
T  (E)
E  T
E  E + T
T  i
T  (E)
q5
E
q2
q7
T
+
T  (E)
E  E+T
)
(
q8
q9
T  (E) 
a single reduce item
corresponds to reduce action
141
Are we done?
• Can make a transition diagram for any
grammar
• Can make a GOTO table for every grammar
• Cannot make a deterministic ACTION table
for every grammar
142
LR(0) conflicts
T
q0
Z  E$
E  T
E  E + T
T  i
T  (E)
T  i[E]
…
(
i
E
…
…
q5
T  i
T  i[E]
Shift/reduce conflict
ZE$
ET
EE+T
Ti
T(E)
T  i[E]
143
LR(0) conflicts
T
q0
Z  E$
E  T
E  E + T
T  i
T  (E)
T  i[E]
…
(
i
E
…
…
q5
T  i
V  i
reduce/reduce conflict
ZE$
ET
EE+T
Ti
Vi
T(E)
144
LR(0) conflicts
• Any grammar with an -rule cannot be LR(0)
• Inherent shift/reduce conflict
– A   – reduce item
– P  αAβ – shift item
– A   can always be predicted from P  αAβ
145
Conflicts
• Can construct a diagram for every grammar
but some may introduce conflicts
• shift-reduce conflict: an item set contains
at least one shift item and one reduce item
• reduce-reduce conflict: an item set
contains two reduce items
146
LR variants
• LR(0) – what we’ve seen so far
• SLR(0)
– Removes infeasible reduce actions via FOLLOW
set reasoning
• LR(1)
– LR(0) with one lookahead token in items
• LALR(0)
– LR(1) with merging of states with same LR(0)
component
147
LR (0) GOTO/ACTIONS tables
GOTO table is indexed by
state and a grammar
symbol from the stack
State
i
q0
q5
q1
ACTION
Table
GOTO Table
+
(
)
$
q7
q3
E
T
action
q1
q6
shift
q2
shift
q2
q3
Z E$
q5
q7
q4
Shift
q4
E E+T
q5
T i
q6
E T
q7
q8
q5
q7
q3
q8
q9
q9
q6
shift
shift
T E
ACTION table determined only by state, ignores input
148
SLR parsing
• A handle should not be reduced to a non-terminal N if the
lookahead is a token that cannot follow N
• A reduce item N  α is applicable only when the
lookahead is in FOLLOW(N)
– If b is not in FOLLOW(N) we just proved there is no derivation S
* βNb.
– Thus, it is safe to remove the reduce item from the conflicted
state
• Differs from LR(0) only on the ACTION table
– Now a row in the parsing table may contain both shift actions and
reduce actions and we need to consult the current token to
decide which one to take
149
Lookahead
token from
the input
SLR action table
State
i
0
shift
1
+
(
)
[
]
$
state
shift
shift
accept
2
3
q0
shift
q1
shift
q2
shift
shift
4
E E+T
E E+T
5
T i
T i
6
E T
E T
7
action
shift
E E+T
shift
T i
E T
shift
8
shift
shift
9
T (E)
T (E)
SLR – use 1 token look-ahead
… as before…
T  i
T  i[E]
T (E)
vs.
q3
shift
q4
E E+T
q5
T i
q6
E T
q7
shift
q8
shift
q9
T E
LR(0) – no look-ahead
150
LR(1) grammars
• In SLR: a reduce item N  α is applicable
only when the lookahead is in FOLLOW(N)
• But FOLLOW(N) merges lookahead for all
alternatives for N
– Insensitive to the context of a given production
• LR(1) keeps lookahead with each LR item
• Idea: a more refined notion of follows
computed per item
151
LR(1) items
• LR(1) item is a pair
– LR(0) item
– Lookahead token
• Meaning
– We matched the part left of the dot, looking to match the part on
the right of the dot, followed by the lookahead token
• Example
– The production L  id yields the following LR(1) items
LR(1) items
[L → ● id, *]
(0) S’ → S
LR(0) items
[L → ● id, =]
(1) S → L = R
[L → ● id, id]
[L → ● id]
(2) S → R
[L → ● id, $]
[L → id ●]
(3) L → * R
[L → id ●, *]
(4) L → id
[L → id ●, =]
(5) R → L
[L → id ●, id]
[L → id ●, $]
152
LALR(1)
• LR(1) tables have huge number of entries
• Often don’t need such refined observation (and
cost)
• Idea: find states with the same LR(0) component
and merge their lookaheads component as long
as there are no conflicts
• LALR(1) not as powerful as LR(1) in theory but
works quite well in practice
– Merging may not introduce new shift-reduce conflicts,
only reduce-reduce, which is unlikely in practice
153
Summary
154
LR is More Powerful than LL
• Any LL(k) language is also in LR(k), i.e., LL(k) ⊂ LR(k).
– LR is more popular in automatic tools
• But less intuitive
• Also, the lookahead is counted differently in the two cases
– In an LL(k) derivation the algorithm sees the left-hand side of the
rule + k input tokens and then must select the derivation rule
– In LR(k), the algorithm “sees” all right-hand side of the derivation
rule + k input tokens and then reduces
• LR(0) sees the entire right-side, but no input token
155
Grammar Hierarchy
Non-ambiguous CFG
LR(1)
LALR(1)
LL(1)
SLR(1)
LR(0)
156
Earley Parsing
Jay Earley, PhD
157
Earley Parsing
• Invented by Jay Earley [PhD. 1968]
• Handles arbitrary context free grammars
– Can handle ambiguous grammars
• Complexity O(N3) when N = |input|
• Uses dynamic programming
– Compactly encodes ambiguity
158
Dynamic programming
• Break a problem P into subproblems P1…Pk
– Solve P by combining solutions for P1…Pk
– Memo-ize (store) solutions to subproblems
instead of re-computation
• Bellman-Ford shortest path algorithm
– Sol(x,y,i) = minimum of
• Sol(x,y,i-1)
• Sol(t,y,i-1) + weight(x,t) for edges (x,t)
159
Earley Parsing
• Dynamic programming implementation of
a recursive descent parser
– S[N+1] Sequence of sets of “Earley states”
• N = |INPUT|
• Earley state (item) s is a sentential form + aux info
– S[i] All parse tree that can be produced (by a
RDP) after reading the first i tokens
• S[i+1] built using S[0] … S[i]
160
Earley Parsing
• Parse arbitrary grammars in O(|input|3)
– O(|input|2) for unambigous grammer
– Linear for most LR(k) langaues (next lesson)
• Dynamic programming implementation of a
recursive descent parser
– S[N+1] Sequence of sets of “Earley states”
• N = |INPUT|
• Earley states is a sentential form + aux info
– S[i] All parse tree that can be produced (by an RDP)
after reading the first i tokens
• S[i+1] built using S[0] … S[i]
161
Earley States
• s = < constituent, back >
– constituent (dotted rule) for Aαβ
A•αβ predicated constituents
Aα•β in-progress constituents
Aαβ• completed constituents
– back previous Early state in derivation
162
Earley States
• s = < constituent, back >
– constituent (dotted rule) for Aαβ
A•αβ predicated constituents
Aα•β in-progress constituents
Aαβ• completed constituents
– back previous Early state in derivation
163
Earley Parser
Input = x[1…N]
S[0] = <E’ •E, 0>; S[1] = … S[N] = {}
for i = 0 ... N do
until S[i] does not change do
foreach s ∈ S[i]
if s = <A…•a…, b> and a=x[i+1] then
S[i+1] = S[i+1] ∪ {<A…a•…, b> }
if s = <A … •X …, b> and Xα then
S[i] = S[i] ∪ {<X•α, i > }
if s = < A … • , b> and <X…•A…, k> ∈ S[b] then
S[i] = S[i] ∪{<X…A•…, k> }
// scan
// predict
// complete
164
Example
P RACTICAL E ARL
S0
S1
S → •E
E → •E + E
E → •n
,0
,0
,0
n
E → n•
S → E•
E → E • +E
,0
,0
,0
E → E + •E
E → •E + E
E → •n
,0
,2
,2
E → n•
E → E + E•
E → E • +E
S → E•
,2
,0
,2
,0
S2
+
if s = <A…•a…, b> and a=x[i+1] then
S[i+1] = S[i+1] ∪ {<A…a•…, b> }
if s = <A … •X …, b> and Xα then
S[i] = S[i] ∪ {<X•α, i > }
if s = < A … • , b> and <X…•A…, k> ∈ S[b] then
S[i] = S[i] ∪{<X…A•…, k> }
// scan
S3
// predict
// complete
n
FI GURE 1. Earley sets for the grammar E → E + E | n and
the input n + n. Items in bold are ones which correspond to the
input’s derivation.
165
166
Download