ITS 015: Compiler Construction

advertisement
Compiler Construction
Parsing Part I
제4주 Parsing
1
What We Did Last Time

The cycle in lexical analysis





RE
NFA
DFA
DFA
→
→
→
→
NFA
DFA
Minimal DFA
RE
Engineering issues in building scanners
2
Today’s Goals

Parsing Part I





Context-free grammars
Sentence derivations
Grammar ambiguity
Left recursion problem with top-down parsing and
how to fix it
Predictive top-down parsing


LL(1) condition
Recursive descent parsing
3
Compilers
4
The Front End
Parser
Checks the stream of words and their parts of speech
(produced by the scanner) for grammatical correctness
Determines if the input is syntactically well formed
Guides checking at deeper levels than syntax
Builds an IR representation of the code
Think of this as the mathematics of diagramming sentences
5
The Study of Parsing (syntax analysis)
The process of discovering a derivation for some sentence
 Need a mathematical model of syntax — a grammar G
 Need an algorithm for testing membership in L(G)
 Need to keep in mind that our goal is building parsers, not studying the
mathematics of arbitrary languages
Roadmap
1. Context-free grammars and derivations
2. Top-down parsing


3.
Hand-coded recursive descent parsers
LL(1) parsers
LL(1) parsed top-down, left to right scan, leftmost derivation, 1 symbol
lookahead
Bottom-up parsing


Operator precedence parsing
LR(1) parsers
LR(1) parsed bottom-up, left to right scan, reverse rightmost derivation, 1
symbol lookahead
6
Syntax analysis

Every PL has rules for syntactic structure.

The rules are normally specified by a CFG (ContextFree Grammar) or BNF (Backus-Naur Form)

Usually, we can automatically construct an efficient
parser from a CFG or BNF.

Grammars also allow SYNTAX-DIRECTED
TRANSLATION.
7
Specifying Syntax with a Grammar
Context-free syntax is specified with a context-free grammar
SheepNoise → SheepNoise baa
| baa
This CFG defines the set of noises sheep normally make
It is written in a variant of Backus–Naur form
Formally, a grammar is a four tuple, G = (S,N,T,P)
 S is the start symbol
(set of strings in L(G))
 N is a set of non-terminal symbols
(syntactic variables)
 T is a set of terminal symbols
(words)
 P is a set of productions or rewrite rules (P :N →(N ∪T)+)
8
The Big Picture
Chomsky Hierarchy of Language Grammars (1956)
9
Deriving Syntax
We can use the SheepNoise grammar to create sentences

use the productions as rewriting rules
While it is cute, this example quickly runs out of intellectual steam ...
10
A More Useful Grammar
To explore the uses of CFGs, we need a more complex
grammar


Such a sequence of rewrites is called a derivation
Process of discovering a derivation is called parsing
We denote this derivation: Expr ⇒* id – num * id
11
Derivations


At each step, we choose a non-terminal to replace
Different choices can lead to different derivations
Two derivations are of interest
 Leftmost derivation — replace leftmost NT at each step
 Rightmost derivation — replace rightmost NT at each step
These are the two systematic derivations
(We don’t care about randomly-ordered derivations!)
The example on the preceding slide was a leftmost derivation

Of course, there is also a rightmost derivation

Interestingly, it turns out to be different
12
The Two Derivations for x – 2 * y
Leftmost derivation
Rightmost derivation
In both cases, Expr ⇒* id – num * id
 The two derivations produce different parse trees


Actually, each of two different derivations produces both parse trees as the grammar
itself is ambiguous
The parse trees imply different evaluation orders!
13
Derivations and Parse Trees
Leftmost derivation
This evaluates as x – ( 2 * y )
14
Derivations and Parse Trees
Rightmost derivation
This evaluates as ( x – 2 ) * y
15
Ambiguity

Definitions




If a grammar has more than one leftmost derivation for a
single sentential form, the grammar is ambiguous
If a grammar has more than one rightmost derivation for a
single sentential form, the grammar is ambiguous
The leftmost and rightmost derivations for a sentential form
may differ, even in an unambiguous grammar
Examples


Associativity and precedence
Dangling else
16
Ambiguous Grammars



This grammar allows multiple leftmost derivations for x - 2 * y
Hard to automate derivation if > 1 choice
The grammar is ambiguous
different choice
than the first time
17
Two Leftmost Derivations for x – 2 * y
The Difference:
 Different productions chosen on the second step
Original choice

New choice
Both derivations succeed in producing x - 2 * y
18
Derivations and Precedence/Association
These two derivations point out a problem with the grammar:
It has no notion of precedence, or implied order of evaluation
To add precedence

Create a non-terminal for each level of precedence

Isolate the corresponding part of the grammar

Force the parser to recognize high precedence subexpressions first
For algebraic expressions

Multiplication and division, first (level one)

Subtraction and addition, next (level two)
To add association

On same precedence

Left-associative : The next-level (higher) nonterminal places at the last of a
production
19
Derivations and Precedence
Adding the standard algebraic precedence produces:
20
Derivations and Precedence
This produces x – ( 2 * y ), along with an appropriate parse tree.
Both the leftmost and rightmost derivations give the same expression,
because the grammar directly encodes the desired precedence.
21
Ambiguous Grammars by dangling else
Classic example — the if-then-else problem
Stmt → if Expr then Stmt
| if Expr then Stmt else Stmt
| … other stmts …
This ambiguity is entirely grammatical in nature
22
Ambiguity
This sentential form has two derivations
if Expr1 then if Expr2 then Stmt1 else Stmt2
production 2, then
production 1
production 1, then
production 2
23
Ambiguity
Removing the ambiguity

Must rewrite the grammar to avoid generating the problem

Match each else to innermost unmatched if (common sense rule)
Intuition: a NoElse always has no else on its last
cascaded else if statement
With this grammar, the example has only one derivation
24
Ambiguity
if Expr1 then if Expr2 then Stmt1 else Stmt2
This binds the else controlling S2 to the inner if
25
Deeper Ambiguity
Ambiguity usually refers to confusion in the CFG
Overloading can create deeper ambiguity
a = f(17)
In many Algol-like languages, f could be either a function or a
subscripted variable
Disambiguating this one requires context
 Need values of declarations
 Really an issue of type, not context-free syntax
 Requires an extra-grammatical solution (not in CFG)
 Must handle these with a different mechanism

Step outside grammar rather than use a more complex grammar
26
Ambiguity - The Final Word
Ambiguity arises from two distinct sources
 Confusion in the context-free syntax
(if-then-else)
 Confusion that requires context to resolve (overloading)
Resolving ambiguity
 To remove context-free ambiguity, rewrite the grammar
 To handle context-sensitive ambiguity takes cooperation



Knowledge of declarations, types, …
Accept a superset of L(G) & check it by other means†
This is a language design problem
Sometimes, the compiler writer accepts an ambiguous grammar


Parsing techniques that “do the right thing”
i.e., always select the same derivation
27
Parsing Techniques
Top-down parsers
(LL(1), recursive descent)
 Start at the root of the parse tree and grow toward leaves
 Pick a production & try to match the input
 Bad “pick” ⇒ may need to backtrack
 Some grammars are backtrack-free
(predictive parsing)
Bottom-up parsers
(LR(1), operator precedence)
 Start at the leaves and grow toward root
 As input is consumed, encode possibilities in an internal state
 Start in a state valid for legal first tokens
 Bottom-up parsers handle a large class of grammars
28
Top-down Parsing
A top-down parser starts with the root of the parse tree
The root node is labeled with the goal symbol of the
grammar

Top-down parsing algorithm:
Construct the root node of the parse tree
Repeat until the fringe of the parse tree matches the input string
1. At a node labeled A, select a production with A on its lhs and, for
each symbol on its rhs, construct the appropriate child
2. When a terminal symbol is added to the fringe and it doesn’t match
the fringe, backtrack
3. Find the next node to be expanded
(label ∈ NT)

The key is picking the right production in step 1

That choice should be guided by the input string
29
The Expression Grammar
Version with precedence derived last lecture
And the input x – 2 * y
30
Example
Let’s try x – 2 * y :
Leftmost derivation, choose productions in an order that exposes problems
31
Example
Let’s try x – 2 * y :
This worked well, except that “–” doesn’t match “+”
The parser must backtrack to here
32
Example
Continuing with x – 2 * y :
This time, “–”
and “–” matched
We can advance past
“–” to look at “2”
⇒ Now, we need to expand Term - the last NT on the fringe
33
Example
Trying to match the “2” in x – 2 * y :
Where are we?
 “2” matches “2”
 We have more input, but no NTs left to expand
 The expansion terminated too soon
⇒ Need to backtrack
34
Example
Trying again with “2” in x – 2 * y :
This time, we matched & consumed all the input
⇒ Success!
35
Another Possible Parse
Other choices for expansion are possible
This doesn’t terminate
(obviously)

Wrong choice of expansion leads to non-termination

Non-termination is a bad property for a parser to have

Parser must make the right choice
36
Left Recursion
Top-down parsers cannot handle left-recursive grammars
Formally,
A grammar is left recursive if ∃ A ∈ NT such that
∃ a derivation A ⇒+ Aα, for some string α ∈ (NT ∪ T )+
Our expression grammar is left recursive

This can lead to non-termination in a top-down parser

For a top-down parser, any recursion must be right recursion

We would like to convert the left recursion to right recursion
Non-termination is a bad property in any part of a compiler
37
Eliminating Left Recursion
To remove left recursion, we can transform the grammar
Consider a grammar fragment of the form
Fee → Fee α
|β
where neither α nor β start with Fee
We can rewrite this as
Fee → β Fie
Fie → α Fie
|ε
where Fie is a new non-terminal
This accepts the same language, but uses only right recursion
38
Eliminating Left Recursion
The expression grammar contains two cases of left recursion
Expr → Expr + Term
| Expr – Term
| Term
Term → Term * Factor
| Term / Factor
| Factor
Applying the transformation yields
Expr → Term Expr′
Expr′ | + Term Expr′
| – Term Expr′
|ε
Term → Factor Term′
Term′ | * Factor Term′
| / Factor Term′
|ε
These fragments use only right recursion
They retain the original left associativity
39
Eliminating Left Recursion
Substituting them back into the grammar yields
• This grammar is correct,
if somewhat non-intuitive.
• It is left associative, as was
the original
• A top-down parser will
terminate using it.
• A top-down parser may
need to backtrack with it.
40
Eliminating Left Recursion
The transformation eliminates immediate left recursion
What about more general, indirect left recursion ?
The general algorithm:
arrange the NTs into some order A1, A2, …, An
Must start with 1 to ensure that A1 →
for i ← 1 to n
A1 β is transformed
for s ← 1 to i – 1
replace each production Ai → Asγ with Ai→ δ1γ |δ2γ|…|δkγ,
where As→ δ1|δ2|…|δk are all the current productions for As
eliminate any immediate left recursion on Ai using the direct
transformation
This assumes that the initial grammar has no cycles
(Ai ⇒+ Ai ), and no epsilon productions
And back
41
Eliminating Left Recursion
How does this algorithm work?
1. Impose arbitrary order on the non-terminals
2. Outer loop cycles through NT in order
3. Inner loop ensures that a production expanding Ai has no nonterminal As in its rhs, for s < i
4. Last step in outer loop converts any direct recursion on Ai to
right recursion using the transformation showed earlier
5. New non-terminals are added at the end of the order & have no
left recursion
At the start of the ith outer loop iteration
For all k < i, no production that expands Ak contains a non-terminal
As in its rhs, for s < k
42
Example

Order of symbols: G, E, T
1. Ai = G
G →E
E→E+T
E→T
T→E~T
T → id
2. Ai = E
3. Ai = T, As = E
4. Ai = T
G →E
E → T E'
E' → + T E'
E' → ε
T→E~T
T → id
G →E
E → T E'
E' → + T E'
E' → ε
T → T E' ~ T
T → id
G →E
E → T E'
E' → + T E'
E' → ε
T → id T'
T' →E' ~ T T'
T' → ε
Go to
Algorithm
43
Roadmap (Where are We?)
We set out to study parsing

Specifying syntax



Top-down parsers



Context-free grammars
Ambiguity
Algorithm & its problem with left recursion
Left-recursion removal
Predictive top-down parsing


The LL(1) condition
Simple recursive descent parsers
44
Picking the “Right” Production
If it picks the wrong production, a top-down parser may backtrack
Alternative is to look ahead in input & use context to pick correctly
How much lookahead is needed?


In general, an arbitrarily large amount
Use the Cocke-Younger, Kasami algorithm or Earley’s algorithm
Fortunately,


Large subclasses of CFGs can be parsed with limited lookahead
Most programming language constructs fall in those subclasses
Among the interesting subclasses are LL(1) and LR(1) grammars
45
Predictive Parsing
Basic idea
Given A → α | β, the parser should be able to choose between α & β
FIRST sets
For some rhs α∈G, define FIRST(α) as the set of tokens that appear as
the first symbol in some string that derives from α
That is, x ∈ FIRST(α) iff α ⇒* x γ, for some γ
We will defer the problem of how to compute FIRST sets until we look
at the LR(1) table construction algorithm
46
Predictive Parsing
Basic idea
Given A → α | β, the parser should be able to choose between α & β
FIRST sets
For some rhs α∈G, define FIRST(α) as the set of tokens that appear
as the first symbol in some string that derives from α
That is, x ∈ FIRST(α) iff α ⇒* x γ, for some γ
The LL(1) Property
If A → α and A → β both appear in the grammar, we would like
FIRST(α) ∩ FIRST(β) = ∅
This would allow the parser to make a correct choice with a
lookahead of exactly one symbol !
This is almost correct
See the next slide
47
Predictive Parsing
What about ε-productions?
⇒ They complicate the definition of LL(1)
If A → α and A → β and ε ∈ FIRST(α), then we need to ensure
that FIRST(β) is disjoint from FOLLOW(α), too
Define FIRST+(α) as

FIRST(α) ∪ FOLLOW(α), if ε ∈ FIRST(α)

FIRST(α), otherwise
Then, a grammar is LL(1) iff A → α and A → β implies
FIRST+(α) ∩ FIRST+(β) = ∅
FOLLOW(α) is the set of
all words in the grammar
that can legally appear
immediately after an α
48
Predictive Parsing
Given a grammar that has the LL(1) property
 Can write a simple routine to recognize each lhs
 Code is both simple & fast
Consider A → β1 | β2 | β3, with
FIRST+(β1) ∩ FIRST+ (β2) ∩ FIRST+ (β3) = ∅
/* find an A */
if (current_word ∈ FIRST(β1))
find a β1 and return true
else if (current_word ∈ FIRST(β2))
find a β2 and return true
else if (current_word ∈ FIRST(β3))
find a β3 and return true
else
report an error and return false
Of course, there is more detail to
“find a βi” (§ 3.3.4 in EAC)
Grammars with the LL(1)
property are called predictive
grammars because the parser
can “predict” the correct
expansion at each point in the
parse.
Parsers that capitalize on the
LL(1) property are called
predictive parsers.
One kind of predictive parser
is the recursive descent
parser.
49
Recursive Descent Parsing
Recall the expression grammar, after transformation
This produces a parser with six
mutually recursive routines:
• Goal
• Expr
• EPrime
• Term
• TPrime
• Factor
Each recognizes one NT or T
The term descent refers to the
direction in which the parse tree
is built.
50
E:
0
E' :
3
T
+
1
4
E'
T
10
2
5
E'
10
6
(Grammar 4.11 )

T:
7
T' :
10
F
*
8
11
T'
F
10
9
12
T'
10
13

F:
14
(
15
E
16
)
E
E'
T
T'
F





TE'
+TE' | 
FT'
*FT' | 
(E) | id
10
17
id
Fig. 4.10. Transition diagrams for grammar (4.11).
51
(a)
(b)

E' :
3
+

4
T
5

0
T
3
+

(c)
3
+

10
6
T
E:
E' :
T
4
10
6
+
4

E:
0
T
3

10
6
10
6
(d)
Fig. 4.11. Simplified transition diagrams.
52
+
E:
0
T
3

10
6
*
T:
7
F:
14
F
(
8
15

E
10
13
16
)
10
17
id
Fig. 4.12. Simplified transition diagrams for arithmetic
expressions.
53
Recursive Descent Parsing
A couple of routines from the expression parser
54
Recursive Descent Parsing
To build a parse tree:
Expr( )
 Augment parsing routines to
result ←true;
build nodes
if (Term( ) = false)
 Pass nodes between routines
then return false;
using a stack
else if (EPrime( ) = false)
 Node for each symbol on rhs
then result ←false;
 Action is to pop rhs nodes,
else
make them children of lhs node,
build an Expr node
and push this subtree
pop EPrime node
To build an abstract syntax tree
 Build fewer nodes
 Put them together in a different
order
pop Term node
make EPrime & Term
children of Expr
push Expr node
return result;
Success ⇒ build a piece of the parse tree
55
Left Factoring
What if my grammar does not have the LL(1) property?
⇒ Sometimes, we can transform the grammar
The Algorithm
∀A ∈ NT,
find the longest prefix α that occurs in two
or more right-hand sides of A
if α ≠ ε then replace all of the A productions,
A → αβ1 | αβ2 | … | αβn | γ ,
with
A → αZ | γ
Z → β 1 | β2 | … | βn
where Z is a new element of NT
Repeat until no common prefixes remain
56
Left Factoring
A graphical explanation for the same idea
A → αβ1
| αβ2
| αβ3
becomes …
A→αZ
Z → β1
| β2
| βn
57
Left Factoring: An Example
Consider the following fragment of the expression grammar
Factor → Identifier
| Identifier [ ExprList ]
| Identifier ( ExprList )
FIRST(rhs1) = { Identifier }
FIRST(rhs2) = { Identifier }
FIRST(rhs3) = { Identifier }
After left factoring, it becomes
Factor
→ Identifier Arguments
Arguments → [ ExprList ]
| ( ExprList )
| ε
FIRST(rhs1) = { Identifier }
FIRST(rhs2) = { [ }
FIRST(rhs3) = { ( }
FIRST(rhs4) = FOLLOW(Factor)
⇒ It has the LL(1) property
This form has the same syntax, with the LL(1) property
58
Left Factoring
Graphically
Identifier
Factor
No basis for choice
Identifier
[
Identifier
]
Identifier
[
Identifier
]
Becomes …
Factor
ε
Identifier
Word determines
correct choice
[
ExprList
]
[
ExprList
]
59
Recursive Descent (Summary)
1.
2.
Build FIRST (and FOLLOW) sets
Massage grammar to have LL(1) condition
a.
b.
3.
Define a procedure for each non-terminal
a.
b.
4.
Remove left recursion
Left factor it
Implement a case for each right-hand side
Call procedures as needed for non-terminals
Add extra code, as needed
a.
b.
Perform context-sensitive checking
Build an IR to record the code
Can we automate this process?
60
Summary

Parsing Part I

Introduction to parsing
grammar, derivation, ambiguity, left recursion

Predictive top-down parsing


LL(1) condition
Recursive descent parsing
61
Next Class


Table-driven LL(1) parsing
Bottom-up parsing
62
Download