Document

advertisement
Outline
Introduction to Parsing
Ambiguity and Syntax Errors
• Regular languages revisited
• Parser overview
• Context-free grammars (CFG’s)
• Derivations
• Ambiguity
• Syntax errors
Compiler Design 1 (2011)
Languages and Automata
Limitations of Regular Languages
• Formal languages are very important in CS
Intuition: A finite automaton that runs long
enough must repeat states
• A finite automaton cannot remember # of
times it has visited a particular state
• because a finite automaton has finite memory
– Especially in programming languages
• Regular languages
– The weakest formal languages widely used
– Many applications
– Only enough to store in which state it is
– Cannot count, except up to a finite limit
• Many languages are not regular
• E.g., language of balanced parentheses is not
regular: { (i )i | i ≥ 0}
• We will also study context-free languages
Compiler Design 1 (2011)
2
3
Compiler Design 1 (2011)
4
The Functionality of the Parser
Example
• Input: sequence of tokens from lexer
• If-then-else statement
if (x == y) then z =1; else z = 2;
• Parser input
IF (ID == ID) THEN ID = INT; ELSE ID = INT;
• Possible parser output
• Output: parse tree of the program
IF-THEN-ELSE
==
ID
Compiler Design 1 (2011)
5
Comparison with Lexical Analysis
=
ID
ID
=
INT
ID
INT
Compiler Design 1 (2011)
The Role of the Parser
Phase
Input
Output
• Not all sequences of tokens are programs . . .
• . . . Parser must distinguish between valid and
invalid sequences of tokens
Lexer
Sequence of
characters
Sequence of
tokens
• We need
Parser
Sequence of
tokens
Parse tree
Compiler Design 1 (2011)
6
– A language for describing valid sequences of tokens
– A method for distinguishing valid from invalid
sequences of tokens
7
Compiler Design 1 (2011)
8
Context-Free Grammars
CFGs (Cont.)
• Many programming language constructs have a
recursive structure
• A CFG consists of
–
–
–
–
• A STMT is of the form
if COND then STMT else STMT
while COND do STMT
…
, or
, or
Assuming X ∈ N the productions are of the form
X→ε
, or
X → Y1 Y2 ... Yn
where Yi ∈ N ∪T
• Context-free grammars are a natural notation
for this recursive structure
Compiler Design 1 (2011)
A set of terminals T
A set of non-terminals N
A start symbol S (a non-terminal)
A set of productions
9
Compiler Design 1 (2011)
Notational Conventions
Examples of CFGs
• In these lecture notes
A fragment of our example language (simplified):
– Non-terminals are written upper-case
– Terminals are written lower-case
– The start symbol is the left-hand side of the first
production
Compiler Design 1 (2011)
10
STMT → if COND then STMT else STMT
⏐ while COND do STMT
⏐ id = int
11
Compiler Design 1 (2011)
12
Examples of CFGs (cont.)
The Language of a CFG
Grammar for simple arithmetic expressions:
Read productions as replacement rules:
E→
⏐
⏐
⏐
X → Y1 ... Yn
E*E
E+E
(E)
id
Compiler Design 1 (2011)
Means X can be replaced by Y1 ... Yn
X→ε
Means X can be erased (replaced with empty string)
13
Compiler Design 1 (2011)
14
Key Idea
The Language of a CFG (Cont.)
(1) Begin with a string consisting of the start
symbol “S”
More formally, we write
X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n
(2) Replace any non-terminal X in the string by
a right-hand side of some production
if there is a production
X → Y1 LYn
X i → Y1 LYm
(3) Repeat (2) until there are no non-terminals in
the string
Compiler Design 1 (2011)
15
Compiler Design 1 (2011)
16
The Language of a CFG (Cont.)
Write
The Language of a CFG
∗
X 1 L X n → Y1 LYm
if
{
Let G be a context-free grammar with start
symbol S. Then the language of G is:
}
∗
a1 K an | S →
a1 K an and every ai is a terminal
X 1 L X n → L → L → Y1 LYm
in 0 or more steps
Compiler Design 1 (2011)
17
Compiler Design 1 (2011)
18
Terminals
Examples
• Terminals are called so because there are no
rules for replacing them
L(G) is the language of the CFG G
Strings of balanced parentheses
• Once generated, terminals are permanent
i i
Two grammars:
• Terminals ought to be tokens of the language
Compiler Design 1 (2011)
{( ) | i ≥ 0}
S
S
19
→ (S )
→ ε
Compiler Design 1 (2011)
OR
S
→ (S )
|
ε
20
Example
Example (Cont.)
A fragment of our example language (simplified):
Some elements of the our language
STMT → if COND then STMT
⏐ if COND then STMT else STMT
⏐ while COND do STMT
⏐ id = int
COND → (id == id)
⏐ (id != id)
Compiler Design 1 (2011)
id = int
if (id == id) then id = int else id = int
while (id != id) do id = int
while (id == id) do while (id != id) do id = int
if (id != id) then if (id == id) then id = int else id = int
21
Compiler Design 1 (2011)
Arithmetic Example
Notes
Simple arithmetic expressions:
The idea of a CFG is a big step.
But:
E → E+E | E ∗ E | (E) | id
Some elements of the language:
id
• Membership in a language is just “yes” or “no”;
we also need the parse tree of the input
id + id
(id)
id ∗ id
(id) ∗ id id ∗ (id)
Compiler Design 1 (2011)
22
• Must handle errors gracefully
• Need an implementation of CFG’s (e.g., yacc)
23
Compiler Design 1 (2011)
24
More Notes
Derivations and Parse Trees
• Form of the grammar is important
A derivation is a sequence of productions
– Many grammars generate the same language
– Parsing tools are sensitive to the grammar
S →L →L →L
A derivation can be drawn as a tree
Note: Tools for regular languages (e.g., lex/ML-Lex)
are also sensitive to the form of the regular
expression, but this is rarely a problem in practice
Compiler Design 1 (2011)
– Start symbol is the tree’s root
– For a production X → Y1 LYn add children
to node X
25
Derivation Example
Y1 LYn
Compiler Design 1 (2011)
26
Derivation Example (Cont.)
• Grammar
E → E+E | E ∗ E | (E) | id
• String
id ∗ id + id
Compiler Design 1 (2011)
E
E
→ E+E
→ E ∗ E+E
→ id ∗ E + E
→ id ∗ id + E
→ id ∗ id + id
27
Compiler Design 1 (2011)
+
E
E
id
*
E
E
id
id
28
Derivation in Detail (1)
Derivation in Detail (2)
E
E
E
+
E
→ E+E
E
Compiler Design 1 (2011)
29
Derivation in Detail (3)
Compiler Design 1 (2011)
30
Derivation in Detail (4)
E
E
→ E+E
→ E ∗ E+E
Compiler Design 1 (2011)
E
+
E
E
*
E
E
→ E+E
E
→ E ∗ E+E
→ id ∗ E + E
E
31
Compiler Design 1 (2011)
+
E
E
*
E
E
id
32
Derivation in Detail (5)
Derivation in Detail (6)
E
E
→ E+E
→ E ∗ E+E
→ id ∗ E + E
→ id ∗ id + E
+
E
E
id
*
→
→
→
→
→
E
E
id
Compiler Design 1 (2011)
E
E
33
E+E
E ∗ E+E
id ∗ E + E
id ∗ id + E
id ∗ id + id
+
E
E
id
*
E
id
id
Compiler Design 1 (2011)
34
Notes on Derivations
Left-most and Right-most Derivations
• A parse tree has
• The example is a left-most
derivation
– Terminals at the leaves
– Non-terminals at the interior nodes
– At each step, replace the
left-most non-terminal
• An in-order traversal of the leaves is the
original input
• There is an equivalent
notion of a right-most
derivation
• The parse tree shows the association of
operations, the input string does not
Compiler Design 1 (2011)
E
E
→
→
→
→
E+E
E+id
E ∗ E + id
E ∗ id + id
→ id ∗ id + id
35
Compiler Design 1 (2011)
36
Right-most Derivation in Detail (1)
Right-most Derivation in Detail (2)
E
E
E
+
E
→ E+E
E
Compiler Design 1 (2011)
37
Right-most Derivation in Detail (3)
Compiler Design 1 (2011)
38
Right-most Derivation in Detail (4)
E
E
→ E+E
→ E+id
Compiler Design 1 (2011)
E
E
+
E
E
E
→ E+E
→ E+id
→ E ∗ E + id
id
39
Compiler Design 1 (2011)
+
E
E
*
E
E
id
40
Right-most Derivation in Detail (5)
Right-most Derivation in Detail (6)
E
E
→ E+E
→ E+id
→ E ∗ E + id
+
E
E
*
→ E ∗ id + id
E
E
E
→
→
→
→
→
E
id
id
Compiler Design 1 (2011)
41
E+E
E+id
E ∗ E + id
E ∗ id + id
id ∗ id + id
+
E
E
id
*
E
E
id
id
Compiler Design 1 (2011)
Derivations and Parse Trees
Summary of Derivations
• Note that right-most and left-most
derivations have the same parse tree
• We are not just interested in whether
s ∈ L(G)
42
– We need a parse tree for s
• The difference is just in the order in which
branches are added
• A derivation defines a parse tree
– But one parse tree may have many derivations
• Left-most and right-most derivations are
important in parser implementation
Compiler Design 1 (2011)
43
Compiler Design 1 (2011)
44
Ambiguity
Ambiguity (Cont.)
• Grammar
This string has two parse trees
E → E + E | E * E | ( E ) | int
E
• String
int * int + int
E
E +
E
E
*
E
* E
int
int
E
+
int
Compiler Design 1 (2011)
45
E
int
int
E
int
Compiler Design 1 (2011)
Ambiguity (Cont.)
Dealing with Ambiguity
• A grammar is ambiguous if it has more than
one parse tree for some string
• There are several ways to handle ambiguity
– Equivalently, there is more than one right-most or
left-most derivation for some string
• Most direct method is to rewrite grammar
unambiguously
• Ambiguity is bad
– Leaves meaning of some programs ill-defined
E→T+E|T
T → int * T | int | ( E )
• Ambiguity is common in programming languages
– Arithmetic expressions
– IF-THEN-ELSE
Compiler Design 1 (2011)
46
• Enforces precedence of * over +
47
Compiler Design 1 (2011)
48
Ambiguity: The Dangling Else
The Dangling Else: Example
• Consider the following grammar
• The expression
if C1 then if C2 then S3 else S4
has two parse trees
S → if C then S
| if C then S else S
| OTHER
if
C1
• This grammar is also ambiguous
if
C2
if
C1
S4
if
S3
C2
S3
S4
• Typically we want the second form
Compiler Design 1 (2011)
49
Compiler Design 1 (2011)
50
The Dangling Else: A Fix
The Dangling Else: Example Revisited
• else matches the closest unmatched then
• We can describe this in the grammar
• The expression if C1 then if C2 then S3 else S4
S → MIF
| UIF
C1
MIF → if C then MIF else MIF
| OTHER
UIF → if C then S
| if C then MIF else UIF
if
C2
S3
C1
S4
• A valid parse tree
(for a UIF)
• Describes the same set of strings
Compiler Design 1 (2011)
if
if
/* all then are matched */
/* some then are unmatched */
51
Compiler Design 1 (2011)
S4
if
C2
S3
• Not valid because the
then expression is not
a MIF
52
Ambiguity
Precedence and Associativity Declarations
• No general techniques for handling ambiguity
• Instead of rewriting the grammar
– Use the more natural (ambiguous) grammar
– Along with disambiguating declarations
• Impossible to convert automatically an
ambiguous grammar to an unambiguous one
• Most tools allow precedence and associativity
declarations to disambiguate grammars
• Used with care, ambiguity can simplify the
grammar
• Examples …
– Sometimes allows more natural definitions
– We need disambiguation mechanisms
Compiler Design 1 (2011)
53
Compiler Design 1 (2011)
54
Associativity Declarations
Precedence Declarations
• Consider the grammar
E → E + E | int
• Ambiguous: two parse trees of int + int + int
• Consider the grammar E → E + E | E * E | int
E
E
E
+
+ E
int
int
– And the string int + int * int
E
E
E
E
int
int E
+
int
E
E
E
+ E
int
E
E
int
int E
+
int
• Precedence declarations: %left +
%left *
• Left associativity declaration: %left +
Compiler Design 1 (2011)
+ E
int
int
*
E
55
Compiler Design 1 (2011)
E
*
E
int
56
Error Handling
Syntax Error Handling
• Purpose of the compiler is
• Error handler should
– To detect non-valid programs
– To translate the valid ones
– Report errors accurately and clearly
– Recover from an error quickly
– Not slow down compilation of valid code
• Many kinds of possible errors (e.g. in C)
Error kind
Lexical
Syntax
Semantic
Correctness
Example
Detected by …
…$…
… x *% …
… int x; y = x(3); …
your favorite program
• Good error handling is not easy to achieve
Lexer
Parser
Type checker
Tester/User
Compiler Design 1 (2011)
57
Compiler Design 1 (2011)
Approaches to Syntax Error Recovery
Error Recovery: Panic Mode
• From simple to complex
• Simplest, most popular method
– Panic mode
– Error productions
– Automatic local or global correction
58
• When an error is detected:
– Discard tokens until one with a clear role is found
– Continue from there
• Not all are supported by all parser generators
• Such tokens are called synchronizing tokens
– Typically the statement or expression terminators
Compiler Design 1 (2011)
59
Compiler Design 1 (2011)
60
Syntax Error Recovery: Panic Mode (Cont.)
Syntax Error Recovery: Error Productions
• Consider the erroneous expression
• Idea: specify in the grammar known common
mistakes
• Essentially promotes common errors to
alternative syntax
• Example:
(1 + + 2) + 3
• Panic-mode recovery:
– Skip ahead to next integer and then continue
– Write 5 x instead of 5 * x
– Add the production E → … | E E
• (ML)-Yacc: use the special terminal error to
describe how much input to skip
• Disadvantage
E → int | E + E | ( E ) | error int | ( error )
– Complicates the grammar
Compiler Design 1 (2011)
61
Syntax Error Recovery: Past and Present
• Past
– Slow recompilation cycle (even once a day)
– Find as many errors in one cycle as possible
– Researchers could not let go of the topic
• Present
–
–
–
–
Quick recompilation cycle
Users tend to correct one error/cycle
Complex error recovery is needed less
Panic-mode seems enough
Compiler Design 1 (2011)
63
Compiler Design 1 (2011)
62
Download