Context-Free Grammars

advertisement
CPSC 388 – Compiler Design
and Construction
Parsers – Context Free Grammars
Announcements
 HW3 due via Sakai
 Solution to HW3
 Homework: HW4 assigned today, due next
Friday, via Sakai
 Reading: 4.1-4.4
 Progress Report Grades due Sept 28th
Compilers
Lexical Analyzer
(Scanner)
Source
Code
Syntax Analyzer
(Parser)
Token
Stream
Abstract
Syntax
Tree
Symantic Analyzer
Intermediate Code
Generator
Optimizer
Code Generator
Scanner
Source Code:
position = initial + rate * 60 ;
Corresponding Tokens:
IDENT(position)
ASSIGN
IDENT(position)
PLUS
IDENT(rate)
TIMES
INT-LIT(60)
SEMI-COLON
Example Parse
Source Code:
position = initial + rate * 60 ;
Abstract-Syntax Tree:
position
=
+
•Interior nodes are operators.
•A node's children are operands.
initial
*
•Each subtree forms "logical unit"
e.g., the subtree with * at its root shows that
because multiplication has higher precedence
than addition, this operation must be performed
rate
as a unit (not initial+rate).
60
Limitations of RE and FSA
 Regular Expressions and Finite State
Automata cannot express all
languages
 For Example the language that
consists of all balanced parenthesis:
() and ((())) and (((((())))))
 Parsers can recognize more types of
languages than RE or FSA
Parsers
 Input: Sequence of Tokens
 Output: a representation of program
 Often AST, but could be other things
 Also find syntax errors
 CFGs used to define Parser
(Context Free Grammar)
Context Free Grammars
stmt → if ( expr ) stmt else stmt
terminals
Rule or Production
non-terminals
Context Free Grammar
CFGs consist of:
Σ – Set of Terminals (use tokens from scanner)
N – Set of Non-Terminals (variables)
P – Set of Productions (also called rules)
S – the Start Non-Terminal (one on left of first
rule if not specified)
Productions
stmt → if ( expr ) stmt else stmt
Sequence of zero or more terminals and non-terminals
Single non-terminal
Example with Boolean Expressions
 “true” and “false” are boolean
expressions
 If exp1 and exp2 are boolean
expressions, then so are:




exp1 || exp2
exp1 && exp2
! exp1
( exp1 )
Corresponding CFG
bexp
bexp
bexp
bexp
bexp
bexp
→
→
→
→
→
→
TRUE
FALSE
bexp OR bexp
bexp AND bexp
NOT bexp
LPAREN bexp RPAREN
UPPERCASE represent tokens (thus terminals)
lowercase represent non-terminals
CFG for Assignments
 Here is CFG for simple assignment
statements
(Can only assign boolean expressions
to identifiers)
stmt → ID ASSIGN bexp SEMICOLON
CFG for simple IF statements
Combine these CFGs and add 2 more rules for
simple IF statements:
1. stmt → IF LPAREN bexp RPAREN stmt
2. stmt → IF LPAREN bexp RPAREN stmt ELSE stmt
3. stmt → ID ASSIGN bexp SEMICOLON
4. bexp → TRUE
5. bexp → FALSE
6. bexp → bexp OR bexp
7. bexp → bexp AND bexp
8. bexp → NOT bexp
9. bexp → LPAREN bexp RPAREN
You Try It
Write a context-free grammar for the
language of very simple while loops
(in which the loop body only contains
one statement) by adding a new
production with nonterminal stmt on
the left-hand side.
CFG Languages
 The language defined by a context-free
grammar is the set of strings (sequences of
terminals) that can be derived from the
start nonterminal.
 Think of productions as rewriting rules
Set cur_seq = starting non-terminal
While (non-terminal, X, exists in cur_seq):
Select production with X on left of “→”
Replace X with right portion of selected
production
 Try it with given CFG
What Strings are in Language
1. stmt → IF LPAREN bexp RPAREN stmt
2.
3.
4.
5.
6.
7.
8.
9.
stmt → IF LPAREN bexp RPAREN stmt ELSE stmt
stmt → ID ASSIGN bexp SEMICOLON
bexp → TRUE
bexp → FALSE
bexp → bexp OR bexp
bexp → bexp AND bexp
bexp → NOT bexp
bexp → LPAREN bexp RPAREN
Set cur_seq = starting non-terminal
While (non-terminal, X, exists in cur_seq):
Select production with X on left of “→”
Replace X with right portion of selected production
Try It Again
exp
exp
exp
term
term
term
factor
factor
→
→
→
→
→
→
→
→
exp PLUS term
exp MINUS term
term
term TIMES factor
term DIVIDE factor
factor
LPAREN exp RPAREN
ID
What is the language?
Leftmost and Rightmost Derivations
 A derivation is a leftmost derivation
if it is always the leftmost
nonterminal that is chosen to be
replaced.
 It is a rightmost derivation if it is
always the rightmost one.
Derivation Notation
 E => a
 E =>* a
 E =>+ a
Parse Trees
Start with the start nonterminal.
Repeat:
choose a leaf nonterminal X
choose a production X --> alpha
the symbols in alpha become the children of
X in the tree
until there are no more leaf
nonterminals left.
The derived string is formed by reading the
leaf nodes from left to right.
Ambiguous Grammars
 Consider the grammar
exp → exp PLUS exp
exp → exp MINUS exp
exp → exp TIMES exp
exp → exp DIVIDE exp
exp → INT_LIT
 Construct Parse tree for 3-4/2
 Are there more than one parse trees?
 If there is more than one parse tree for a string then
the grammar is ambiguous
 Ambiguity causes problems with parsing (what is the
correct structure)?
Precedence in Grammars
To write a grammar whose parse trees
express precedence correctly, use a
different nonterminal for each
precedence level. Start by writing a
rule for the operator(s) with the
lowest precedence ("-" in our case),
then write a rule for the operator(s)
with the next lowest precedence, etc:
Precedence in Grammars
exp
term
factor
→ exp MINUS exp | term
→ term DIVIDE term | factor
→ INT_LIT | LPAREN exp RPAREN
 Now try constructing multiple parse
trees for 3-4/2
 Grammar is still ambiguous. Look at
associativity. Construct 2 parses tree
for 5-3-2.
Recursion on CFGs
 A grammar is recursive in nonterminal X
if: X derives a sequence of symbols that
includes an X.
 A grammar is left recursive in X if: X
derives a sequence of symbols that starts
with an X.
 A grammar is right recursive in X if: X
derives a sequence of symbols that ends
with an X.
 For left associativity, use left recursion.
 For right associativity, use right
recursion.
Ambiguity Removed in CFG
exp
term
factor
→ exp MINUS term | term
→ term DIVIDE factor | factor
→ INT_LIT | LPAREN exp RPAREN
 One level for each order of operation
 Left recursion for left assiciativity
 Now try constructing 2 parse trees for
5-3-2.
You Try it
 Construct a grammar for arithmetic
expressions with addition, multiplication,
exponentiation (right assoc.), subtraction,
division, and unary negative.
exp
→
exp PLUS exp |
exp MINUS exp |
exp TIMES exp |
exp DIVIDE exp |
exp POW exp |
MINUS exp |
LPAREN exp RPAREN |
INT_LIT
Solution
exp
→
term
→
factor
→
exponent
→
final
→
exp PLUS term |
exp MINUS term |
term
term TIMES factor |
term DIVIDE factor |
factor
exponent POW factor |
exponent
MINUS exponent |
final
INT_LIT |
LPAREN exp RPAREN
You Try It
 Write a grammar for the language of boolean
expressions, with two possible operands: true false,
and three possible operators: and or not. Add
nonterminals so that or has lowest precedence, then
and, then not. Finally, change the grammar to reflect
the fact that both and and or are left associative.
bexp → TRUE
bexp → FALSE
bexp → bexp OR bexp
bexp → bexp AND bexp
bexp → NOT bexp
bexp → LPAREN bexp RPAREN
List Grammars
 Several types of lists can be created using
CFGs
 One or more x's (without any separator or
terminator):
1. xList → X | xList xList
2. xList → X | xList X
3. xList → X | X xList
 One or more x's, separated by commas:
1. xList → X | xList COMMA xList
2. xList → X | xList COMMA X
3. xList → X | X COMMA xList
List Grammars
 One or more x's, each x terminated by a
semi-colon:

1.
2.
3.
You Try
xList →
xList →
xList →
It
X SEMICOLON | xList xList
X SEMICOLON | xList X SEMICOLON
X SEMICOLON | X SEMICOLON xList
 Zero or more x's (without any separator or
terminator):
1. xList → ε | X | xList xList
2. xList → ε | X | xList X
3. xList → ε | X | X xList
List Grammars
 Zero or more x's, each terminated by a
semi-colon:
1. xList → ε | X SEMICOLON | xList xList
2. xList → ε | X SEMICOLON | xList X SEMICOLON
3. xList → ε | X SEMICOLON | X SEMICOLON xList
 Zero or more x's, separated by commas:

You Try It
1. xList
→ ε | nonEmptyXList
nonEmptyXList → X | X COMMA nonEmptyXList
CFGs for Whole Languages
 To write a grammar for a whole
programming language, break down
the problem into pieces. For example,
think about a Java program: a
program consists of one or more
classes:
program
→ classlist
classlist
→ class | class classlist
CFGs for Whole Languages
 A class is the word "class", optionally
preceded by the word "public",
followed by an identifier, followed by
an open curly brace, followed by the
class body, followed by a closing curly
brace:
class → PUBLIC CLASS ID LCURLY classbody RCURLY |
CLASS ID LCURLY classbody RCURLY
CFGs for Whole Languages
 A class body is a list of zero or more
field and/or method definitions:
classbody
deflist
→ ε | deflist
→ def | def deflist
And So On…
Download