Uploaded by Wolfe Weeks

Context-Free Grammar and Parsing

advertisement
Context-Free Grammar and Parsing
Cmp Sci 4280
Context-free grammar
• What is a context-free grammar?
• A set of recursive rules used to generate patterns of strings
• We can think of a computer program as a pattern of strings
Context-free grammar
• Nearly entire syntax of whole
programming languages can be
described by context-free grammars
• Chomsky developed in 1950’s
Noam Chomsky
• Backus-Naur Form (BNF)
– Nearly identical to Chomsky’s context-free
grammars
– Natural for describing syntax
• Similar method used by Panini to describe Sanskrit
syntax more than 2000 years ago
• We’ll use ‘BNF’ and ‘grammar’ interchangeably
BNF
• BNF is a metalanguage for programming languages
– It is a language that is used to describe another language
• Uses terminals
– Tokens in computer program
• Uses abstractions called nonterminals
– Nonterminals are enclosed in <> brackets or start with
uppercase letter
<assign> → <var> = <expression>
• LHS is the nonterminal being defined
• RHS consists of some mixture of:
• Terminals (tokens)
• References to other abstractions (nonterminals)
• Altogether the definition is called a rule or production
A BNF grammar is a collection of productions
BNF
• Nonterminals can have two or more distinct
definitions
– These can be listed separately in different rules
– Or they can be listed in one rule and separated by ‘|’
<expression> → <var> + <var>
<expression> → <var> - <var>
<expression> → <var> + <var>
| <var> - <var>
Recursion
• A rule is recursive if its LHS appears in its RHS
<ident_list> → identifier
| identifier, <ident_list>
• BNF is able to create lists to any length use recursion
• BNF is sufficiently powerful to describe:
• Lists
• The order in which constructs must appear
• Nested structures to any depth
• Imply operator precedence and associativity
• A grammar is a generative device for defining languages
• Begin with a special nonterminal of the grammar called the
start symbol
• Start symbol for a computer program is often named
<program> or <start>
What is the shortest
program we could
generate?
A sequence of rule applications is called a derivation
What is one derivation of this language?
Each successive string is derived from the previous string by
replacing one of the nonterminals with one of that nonterminal’s
definitions
Here the leftmost nonterminal is replaced at each step
Referred to as ‘leftmost derivation’
Parentheses
What leftmost derivation
would produce:
A=B*(A+C)
One approach to determine if a given program is in given CFG:
Not efficient in practice
Language Recognizers
CFG Notes
Derivation is useful for testing for syntax errors, but compiler needs to be able to recover
program structure to generate target
Parse trees are used for this purpose
Leftmost derivation of A = B * ( A + C )
• Every internal node is labeled with a nonterminal
• Every leaf is labeled with a terminal
Parentheses are usually omitted
Consider this grammar:
E -> E + E | E - E | E * E | E / E | ( E ) | ID
ID -> x | y | z
What is leftmost derivation of:
(x+y)/(x–y)
Build the parse tree
Consider this grammar:
E -> E + E | E - E | E * E | E / E | ( E ) | ID
ID -> x | y | z
What is leftmost derivation of:
x+y*z
Build the parse tree
Can you draw a structurally different
(not isomorphic) tree?
Let x = 1, y = 2, z = 3
Do these trees produce the same results?
Ambiguity
• A grammar that is able to generate more than one
different tree for a given sentence
– Not just isomorphically different
Problem is lack of precedence on operators
What is 3 x 4 + 2 ^ 2 ?
What is 3 x (4 + 2) ^ 2 ?
Standard precedence:
1. Parantheses
2. Exponentiation
3. Multiplication / Division
4. Addition / Subtraction
PEMDAS
What about -2 ^ 2 x 3 ?
Unary negation has higher precedence than multiplication/division
but lower precedence than exponentiation for standard rules
Not always standard precedence in this class!
How can we fix this grammar?
E -> E + E | E - E | E * E | E / E | ( E ) | id
id -> x | y | z
E
T
F
id
-> E + E | E - E | T
-> T * T | T / T | F
-> ( E ) | id
-> x | y | z
E
T
F
id
-> E + E | E - E | T
-> T * T | T / T | F
-> ( E ) | id
-> x | y | z
Use this modified grammar to parse:
x-y*z
Use this modified grammar to parse:
x-y+z
What is a different parse tree for this sentence?
We must consider associativity!
Operator Associativity
If two operators have the same precedence in an expression,
associativity of operators indicate the order in which they are executed
What is the direction of associativity for the following in C/C++?
Assignment statements: y = 3
x=y=7
What are the values of x and y?
Assignment is right associative in C/C++
Operator Associativity
What is the direction of associativity for the following in C/C++?
y=3-2+5
What is the value of y?
Addition/subtraction is left associative in C/C++
Multiplication/division is left associative in C/C++
How can we fix this grammar?
We can rewrite the grammar again
while restricting the recursion
accordingly
E
T
F
id
-> E + E | E - E | T
-> T * T | T / T | F
-> ( E ) | id
-> x | y | z
E
T
F
id
-> E + T | E - T | T
-> T * F | T / F | F
-> ( E ) | id
-> x | y | z
Use this modified grammar to parse:
x-y+z
Download