File - Automata Theory and Formal Languages

advertisement
Lecture Five:
Context Free Grammar (CFG)
Amjad Ali
CFG, Lecture 5, slide
Definition of Context-Free Grammar
There are four important components in a grammatical description of a language:
1.
There is a finite set of symbols that form the strings of the language being
defined. This set was {0,1} in the palindrome example we just saw. We call this
alphabet the terminals, or terminal symbols.
2.
There is a finite set of variables, also called sometimes nonterminals or syntactic
categories. Each variable represents a language; i.e., a set of strings. In our
example above, there was only one variable, P, which we used to represent the
class of palindromes over alphabet {0,1}.
CFG, Lecture 5, slide
3.
One of the variables represents the language being defined; it is called the start
symbol. Other variables represent auxiliary classes of strings that are used to
help define the language of the start symbol. In our example, P , the only
variable , is the start symbol.
4.
There is a finite set of productions or rules that represent the recursive
definition of a language. Each production consists of:
a)
A variable that is being (partially) defined by the production. This variable
is often called the head of the production.
b)
The production symbol
CFG, Lecture 5, slide
c) A string of zero or more terminals and variables. This string, called the body
of the production, represents one way to form strings in the language of the
variable of the head. In so doing, we leave terminals unchanged and substitute
for each variable if the body any string that is known to be in language of that
variable.
CFG, Lecture 5, slide
Alternate Definition of Context-Free
Grammar
A context-free grammar, CFG is a collection of three things:
1.
An alphabet Σ of letters called terminals from which we are going to make
strings that will be the words of a language.
2.
A set of symbols called nonterminals, one of which is the symbol S, standing
for “start here”.
3.
A finite set of productions of the form.
One Nonterminals finite
set of terminals and/or Nonterminals
CFG, Lecture 5, slide
Formal Definition of CFG
A context-free grammar is a 4-tuple (V, Σ, R ,S), where
1.
V is finite set called the variables.
2.
Σ is a finite set, disjoint from V, called the terminals.
3.
R is a finite set of rules, with each rule being a variable and a string of
variables and terminals, and
4.
SV is the start variable.
CFG, Lecture 5, slide
Palindrome Example
Some of the rules that define the palindromes, expressed in the context-free grammar
notation, are:
1. P
^
2. P
0
3. P
1
4. P
0P0
5. P
1P1
CFG, Lecture 5, slide
Notions for CFG Derivations
Some conventions used while discussing CFG’s:
1.
Lower-case letters near the beginning of the alphabet, a, b, and so on, are terminal
symbols. Digits and other characters such as + or parentheses can also be used as
terminals.
2.
Upper-case letters near the beginning of the alphabet, A, B, and so on, are
variables.
3.
Lower-case letters near the end of the alphabet, such as w or z, are strings of
terminals. This convention reminds us that the terminals are analogous to the input
symbols of an automation.
4.
Upper-case letters near the end of the alphabet, such as X or Y, are either
terminals or variables.
CFG, Lecture 5, slide
5. Lower-case Greek letters, such as alpha and beta, are strings consisting of terminals and/or
variables.
There is no special notation for strings that consist of variables only, since this concept plays
no important role. However, a string named alpha or another Greek letter might happen to
have only variables.
CFG, Lecture 5, slide
Example:
A complex CFG that represents (a simplification of ) expressions in a typical
programming language. Operators used are limited to + and *, representing addition
and multiplication respectively. Arguments act as identifiers, but instead of full set of
typical identifiers (letters followed by zero or more letters and digits). The letters are a
and b and the digits 0 and 1. Every identifier begins with a or b, which may be followed
by any string in {a, b, 0, 1}* .
CFG, Lecture 5, slide
Two variables used in this grammar:
1. E which represents expressions and it represents the language of expressions we
are defining.
2. I represents identifiers.
The productions will be:
1. E
2. E
3. E
4. E
5. I
6. I
7. I
8. I
9. I
10. I
I
E+E
E*E
(E)
a
b
Ia
Ib
I0
I1
CFG, Lecture 5, slide
Suppose a string of the above CFG is a*(a+b00).
Its derivations will be:
E => E * E
Production no. 3
=> I * E
Production no. 1
=> a * E
Production no. 5
=> a * (E)
Production no. 4
=> a * (E + E)
Production no. 2
=> a * (I + E)
Production no. 1
CFG, Lecture 5, slide
=> a * a (a + E)
Production no.5
=> a * a (a + I)
Production no.1
=> a * a (a + I0)
Production no. 9
=> a * a (a + I00)
Production no. 9
=> a * (a + b00)
Production no. 6
CFG, Lecture 5, slide
Leftmost and Right most Derivations
Leftmost derivation:
In order to restrict the number of choices we have in deriving a string, it is
often useful to require that at each step we replace the leftmost variable by one of its
production bodies. Such a derivation is called a leftmost derivation.
Rightmost derivation:
In order to restrict the number of choices we have in deriving a string, it is
often useful to require that at each step we replace the rightmost variable by one of its
production bodies. Such a derivation is called a rightmost derivation.
CFG, Lecture 5, slide
Example:
The inference that a*(a+b00) is in the language of variable E can
be reflected in a derivation of that string, starting with the string E.
Leftmost derivation will be:
E => E * E => I * E => a * E => a * (E) => a * (E + E)
lm
lm
lm
lm
lm
=> a * ( I + E ) => a * ( a + E) => a * ( a + I) =>
lm
lm
lm
lm
a * ( a + I0) => a * ( a + I00) => a * ( a + b00)
lm
lm
* a*(a+b00) or E * E *
We can summarize the leftmost derivation as E =>
=> a * (E)
lm
lm
CFG, Lecture 5, slide
Rightmost derivation will be:
E rm
=> E * E rm
=> E * (E) => E * (E + E) => E * (E + I) => E * (E + I0)
rm
rm
rm
=> E * ( E + I00 ) => E * (E + b00) => E * (I + b00) =>
rm
rm
rm
rm
E * ( a + b00) =>
I * ( a + b00) =>
a * ( a + b00)
rm
rm
So the rightmost derivation can be expressed as E rm
=> a*(a+b00).
CFG, Lecture 5, slide
Inference, Derivations and Parse Trees
I.
The recursive inference procedure determines that terminal string w is in the
language of variable A.
II. A=>w.
*
III. A =>w.
*
lm
IV. A =>w.
*
rm
V.
There is a parse tree with root A and yield w.
CFG, Lecture 5, slide
Some Examples:
Example#1:
Let the terminal be a and the nonterminal be S, and the productions be
S
aS
S
^
The above language is a*.
To derive a6 in this CFG the following derivations will be used.
S => aS
=> aaS
=> aaS
=> aaaS
=> aaaaS
=> aaaaaS
=> aaaaaaS
=> aaaaaa^
= aaaaaa
Notice:
i.
means “can be replaced
by” as in S
aS.
ii. => means “can develop
into” as in aaS => aaaS
CFG, Lecture 5, slide
Example#2:
Let the terminals be a and b and the only nonterminal be S, and the productions be
S
S
S
S
aS
bS
a
b
The language generated by this CFG is the set of all possible strings of letters a and b
except for the null string, which we cannot generate.
To produce the string baab the following derivations will be used.
S => bS
=> baS
=> baaS
=> baab
CFG, Lecture 5, slide
Example#3:
Let the terminals be a and b, the only nonterminal be S, and the productions be
S
aS
S
bS
S
a
S
b
S
^
The word ab can be generated by the derivation
S =>aS
=>abS
=>ab^
=ab
or by the derivation
S=>aS
=>ab
The language of this CFG is also (a+b)*, but the sequence of productions that is used to
generate a specific word is not unique.
The third and fourth productions are redundant.
CFG, Lecture 5, slide
Example#4:
Let the terminals be a and b, the only nonterminal be S and X, and the productions be
S
XaaX
X
aX
X
bX
X
^
The words generated from S have the form
anything aa anything
or
(a+b)*aa(a+b)*
which is the language of all words with a double a in them somewhere.
For example, to generate baabaab, we can proceed as follows:
S=>XaaX=>bXaaX=>baXaaX=>baaXaaX=>baabXaaX
=>baab^aaX=>baabaaX=>baabaabX=>baabaab^=baabaab
CFG, Lecture 5, slide
Example#5:
Let the terminals be a and b, the only nonterminal be S,X and Y and the productions be
S
XY
X
aX
X
bX
X
a
Y
Ya
Y
Yb
Y
a
X productions are:
X
aX
X
bX
X
a
In the preceding productions, it can be seen that:
o
any string of terminals that comes from X must end in an a
o
any words ending in an a can be derived from X
CFG, Lecture 5, slide
To derive the word babba from X, the procedure will be:
X=>bX=>baX=>babX=>babbX=>babba
Considering variable Y:
Y productions are:
Y
Y
Y
Ya
Yb
a
It can be seen that the words that can be derived from Y:
o
Exactly those that begin with an a
To derive abbab, the procedure will be:
Y=>Yb=>Yab=>Ybab=>Ybbab=>abbab
CFG, Lecture 5, slide
Since
S
XY
The words that can be derived from S have a double a in them.
To derive babaabb, the procedure will be:
S=>XY=>bXY=>baXY=>babXY=>babaY=>babaYb=>babaYbb
=>babaabb
CFG, Lecture 5, slide
Example#6:
Let the terminals be a and b, and the three nonterminals be S, BALANCED, and
UNBALANCED.
The productions are:
S
SS
S
BALANCED S
S
S BALANCED
S
^
S
UNBALANCED S UNBALANCED
BALANCED
aa
BALANCED
bb
UNBALANCED
ab
UNBALANCED
ba
In the preceding productions, it can be seen that:
o
The language generated is the set of all words with an even number of a’s and an
even number of b’s i.e. the language EVEN-EVEN.
CFG, Lecture 5, slide
Derivation of word aababbab:
S=>BALANCED S
=>aaS
=>aa UNBALANCED S UNBALANCED
=>aa ba S UNBALANCED
=>aa ba S ab
=>aa ba BALANCED S ab
=>aa ba bb S ab
=>aa ba bb ^ ab
= aababbab
CFG, Lecture 5, slide
Example#7:
Let the terminals be a and b, and only one nonterminal S.
The productions are:
S
S
aSb
^
The language generated by these productions is the nonregular language anbn.
Derivation of a6Sb6 using the above productions:
S=>aSb=>aaSbb
=>aaaSbbb=>aaaaSbbbb
=>aaaaaSbbbbb=>aaaaaaSbbbbbb
=>aaaaaabbbbbb
CFG, Lecture 5, slide
Example#8:
Let the terminals be a and b, and only one nonterminal S.
The productions are:
S
S
S
aSa
bSb
^
The language generated by these productions is the nonregular language PALINDROME(a
word that reads the same backwards as forwards.
Derivation of word abbaabba using the above productions:
S=>aSb=>aaSbb
=>aaaSbbb=>aaaaSbbbb
=>aaaaaSbbbbb=>aaaaaaSbbbbbb
=>aaaaaabbbbbb
CFG, Lecture 5, slide
Derivation of word abbaabba using the above productions:
S =>aSa
=>abSba
=>abbSbba
=>abbaSabba
=>abbaabba
CFG, Lecture 5, slide
Example#9:
ODD PALINDROME language is the language containing odd number of letters in words.
To convert a general palindrome(which can contain both even and odd letters).
Grammar for ODD PALINDROME is:
S => aSa
S => bSb
S => a
S => b
The above grammar can be modified to be the entire languae PALINDROME as:
S => aSa
S => bSb
S => a
S => b
S => ^
CFG, Lecture 5, slide
Example#10:
A nonregular language that can be generated by CFG is anban.
S => aSa
S => b
CFG, Lecture 5, slide
Example#11:
Let the terminals be a and b, the nonterminals be S, A, and B, and the productions be
S
S
A
A
A
B
B
B
aB
bA
a
aS
bAA
b
bS
aBB
The language that this CFG generates is the language EQUAL of all strings that have an
equal number of a’s and b’s in them.
Some words of this language are abba, aaabbb, and ba.
CFG, Lecture 5, slide
Ambugity
Definition:
A CFG is called ambiguous if for at least one word in the language that it generates
there are two possible derivations of the word that correspond to different syntax trees.
If a CFG is not ambiguous, it is called unambiguous.
Ambiguous Grammars:
Consider the form E + E * E. It has two derivations from E.
1.
E=> E + E => E + E * E
2.
E=> E * E => E + E * E
CFG, Lecture 5, slide
E
E
E
+
E
E
*
E
E
fig. I
*
E
E
+
E
fig. II
Two parse trees with the same yield
CFG, Lecture 5, slide
Removing Ambiguity from Grammars
There are two causes of ambiguity in the previous ambiguous grammar:
I.
The precedence of operators is not respected. While fig. I properly groups the
* before the + operator, fig. II is also a valid parse tree and groups the +
ahead of the *. We need to force only the structure of fig. I to be legal in an
unambiguous grammar.
II. A sequence of identical operators can group either from the left or from the
right. For example, if the *’s in fig(I and II) were replaced by +’s, we would see
two different parse trees for the string E + E + E. Since addition and
multiplication are associative, it doesn’t matter whether we group from the left
or the right, but to eliminate ambiguity, we must pick one. The conventional
approach is to insist on grouping from the left, so the structure of fig. II is the
only correct grouping of two +-signs
CFG, Lecture 5, slide
Download