courses:cs240-201601:cfl-pumping-lemma.pptx (141.3 KB)

advertisement
Chomsky Normal Form
•
We skipped this section even though it appears earlier in
the text:
Chomsky Normal Form (CNF)
•
only production forms are
–
–
A BC
A a
–
S
• …where A, B, C are nonterminals, a is a terminal, S is
the start symbol and  is the empty string.
• Every CNF grammar is a CFG
– Every CFG can be transformed into an equivalent CNF grammar
• We will use CNF conversion algorithms to clean
up needlessly complex grammars.
• Recommended: Check out his Wikipedia & FB pages !
Cleaning Up Grammars
•
We can "simplify" grammars to a great extent, e.g.:
1. Get rid of -productions
•
•
Variables of the form variable  
But you lose the ability to generate  as a string in the
language
2. Get rid of useless symbols
•
Variables that do not participate in any derivation of a
terminal string
3. Get rid of unit productions
•
Variables of the form variable  variable
Any CFG can be converted via these and other methods to
Chomsky Normal Form (CNF)
•
Again, the only production forms are
–
–
A BC
A a
Getting Rid of the Empty String
• No, didn’t forget S → ε…
• Empty string is a nuisance with grammars and
languages in general
• We will look at languages that do not contain 
• No loss of generality:
– For language L, let G = (V,T,S,P) be a CFG that
generates L - {}
– Modify grammar by adding a new start variable S0
and add productions S0  S | 
– This grammar generates L
– Therefore any non-trivial conclusion we make for L {} should transfer to L
Eliminating -Productions
• A variable A is nullable if A * 
• Find them by a recursive algorithm:
– Basis: If A  is a production, then A is nullable
– Induction: If A is the head of a production
whose body consists of only nullable symbols,
then A is nullable
 Once we have the nullable symbols, we can
add additional productions and then throw
away the productions of the form A  for
any A
• If A  X1X2 …Xk is a production, add
all productions that can be formed by
eliminating some or all of those Xi's
that are nullable
But, don't eliminate all k if they are all
nullable
Example
– If A  BC is a production, and both B and C
are nullable, add A  B | C
Example
Grammar:
S  aA
A  aABC | bB | a
Bb|
Cc|
Add productions to account for
strings generated when one or
more RHS symbols go to 
S  aA
A  aABC | bB | a | aAB | aA | aAC |
b
Bb|
Cc|
Nullable:
• C, B are nullable, derive 
• Neither A nor S is nullable (no
right hand side with all nullable
symbols)
Eliminate -productions:
S  aA
A  aABC | bB | a | aAB | aA | aAC | b
Bb
Cc
Resulting grammar with
no -productions
Useless Symbols
• In order for a symbol X to be useful, it
must:
1. Derive some terminal string (possibly X is a
terminal)
2. Be reachable from the start symbol; i.e., S* 
X
• Note that X wouldn't really be useful if  or 
included a symbol that didn't satisfy (1), so it is
important that (1) be tested first, and symbols that
don't derive terminal strings be eliminated before
testing (2)
Finding Symbols That Don't
Derive Any Terminal String
• Recursive construction:
– Basis: A terminal surely derives a terminal
string
– Induction: If A is the head of a production
whose body is X1X2 …Xk, and each Xi is
known to derive a terminal string, then
surely A derives a terminal string
• Keep going until no more symbols that derive
terminal strings are discovered
Example
S  AB | C
A  0B | C
B  1 | A0
C  AC | C1
 Round 1: 0 and 1 are "in"




Round 2: B  1 says B is in
Round 3: A  0B says A is in
Round 4: S  AB says S is in
Round 5: Nothing more can be added
• Thus, C can be eliminated, along with any production
that mentions it, leaving S  AB; A  0B; B  1 | A0
Finding Symbols That Can't Be
Derived From the Start Symbol
• Another recursive algorithm:
– Basis: S is "in"
– Induction: If variable A is in, then so is
every symbol in the production bodies
for A
• Keep going until no more symbols
derivable from S can be found
Example
S  AB
A  0B
B  1 | A0
 Round 1: S is in
 Round 2: A and B are in
 Round 3: 0 and 1 are in
 Round 4: Nothing can be added
 In this case, all symbols are derivable from S, so no
change to grammar
• Book has an example where not only are there symbols
not derivable from S, but you must eliminate first the
symbols that don't derive terminal strings, or you get the
wrong grammar
Eliminating Unit Productions
1. Eliminate useless symbols and -productions
2. Discover those pairs of variables (A, B) such
* AB
that
– Because there are no  -productions, this derivation can only
use unit productions
3. Replace each combination where A* B* 
and  is other than a single variable by A  
– I.e., "short circuit" sequences of unit productions, which must
eventually be followed by some other kind of production
4. Remove all unit productions
Chomsky Normal Form
1. Get rid of useless symbols, -productions, and
unit productions (already done)
2. Get rid of productions whose bodies are mixes of
terminals and variables, or consist of more than
one terminal
3. Break up production bodies longer than 2
Result
All productions are of the form A  BC or A  a
No Mixed Bodies
1. For each terminal a, introduce a new
variable Aa, with one production Aa  a
2. Replace a in any body where it is not the
entire body by Aa
– Now, every body is either a single terminal or it
consists only of variables
Example
• A  0B1 becomes A0  0; A1  1; A 
A0BA1
Example: Earlier Grammar
• Grammar from which -productions were removed
– Contained no unit productions or useless symbols
S  aA
A  aABC | bB | a | aAB | aA | aAC | b
Bb
Cc
S  aA
A  aABC | bB | a | aAB | aA | aAC | b
Bb
Cc
Already
Aa  a
have
variables for
S  AaA
b and c
A  AaABC | BB | a | AaAB | AaA | AaAC |
b
Bb
Cc
Aa  a
Making Bodies Short
• If we have a production like A  BCDE,
we can introduce some new variables that
allow the variables of the body to be
introduced one at a time
– A body of length k requires k - 2 new variables
Example
– Introduce F and G; replace A  BCDE by A
 BF; F  CG; G  DE
Example: Earlier Grammar
S  AaA
A  AaABC | BB | a | AaAB | AaA | AaAC | b
Bb
Cc
Aa  a
S  AaA
A  AaD | BB | a | AaAB | AaA | AaAC | b
Bb
Cc
S  AaA
Aa  a
A  AaD | BB | a | AaF | AaA | AaG | b
D  AE
Bb
E  BC
Cc
D  AE
Chomsky
E  BC
Normal Form!
F  AB
G  AC
Full Procedure
• Perform each step in order:
1.
2.
3.
4.
5.
Eliminate -productions
Eliminate useless symbols
Eliminate unit productions
Eliminate mixed bodies
Make all bodies short
Summary Theorem
If L is any CFL, there is a
grammar G that generates L {}, for which each production
is of the form A  BC or A 
a, and there are no useless
symbols
CFL Pumping Lemma
• Similar to regular-language PL, but you have
to pump two strings in the middle of the
string, in tandem (i.e., the same number of
copies of each). Formally:
–  CFL L
–
–
–
–
 integer n
 z in L, with |z|  n
 uvwxy = z such that |vwx|  n and |vx| > 0
 i  0, uviwxiy is in L
The part of the string containing the
pumped bit does not have to start at
the beginning of the string!
Pumping a regular language
...
Can take this loop 0 or
more times, with no way to
control the number of
iterations
Pumping a context free language
Stack ensures that the
number of passes through
these two loops is
coordinated
...
If you take one loop you must take
the other, but no way to control the
number of iterations
...
Outline of Proof of PL
• Let there be a CFG for L
• Let b be the maximum number of symbols on the right
hand side of a rule (assume at least 2)
– No node can have more than b children
• At most b leaves are 1 step from the start variable, at most b2 leaves are
within 2 steps from the start variable, at most bh leaves are within h steps
from the start variable
– If height of the tree is at most h, length of generated string is at
most bh
– Conversely, if the generated string is at least bh +1 long,
parse tree must be h +1 high
– Thus, some variable must appear twice on the path
• Compare with the DFA argument about a path longer than the
number of states
S
S
A
A
A
u
v
w
x
w
S
y
u
A
A
A
u v
v
w
x y
x
y
• A variable can be replaced by one of
its right hand sides any number of
times
• By repeatedly replacing the lower A's
tree by the upper A's tree, we see
uviwxiy has a parse tree for all i > 1
– And replacing the upper by the lower shows
the case i = 0; i.e., uwy is in L
Consider the derivation S* uAy*uvAxy*uvwxy
S
u
A
y
* vAx
A
A
v
* w
A
x
 uviwxiy  L
w
Pumping Length
• Pumping lemma constant for CFLs is b|V|+1
where V is the number of variables in the
grammar and b is the length of the longest
RHS
– The derivation tree for a sufficiently long string must have a
height of at least |V|+ 1
– It has at least b|V|+1 leaf nodes (by definition), and therefore
its height is equal to or greater than b|V| + 1
• Consider a leaf and the b + 1 nodes above it:
since there are only b variables, one must
appear twice
Using the CFL Pumping Lemma to Prove a
Language is not Context-free
The classic non-CFL
Example
L = {aibici | i  0} is not a CFL.
•
•
Suppose it were. Then let n be the PL constant for
L.
Consider z = anbncn. We can write z = uvwxy, with
|vwx|  n and |vx| > 0 (i.e., either v or x is nonempty), and for all i ≥ 0, uviwxiy is in L.
Note that unlike the PL for regular
languages, the pumpable part (vx) need
not start at the beginning of the string
N.B.
• As with the pumping lemma proof for
regular languages, must show there is
at least one string for which there is no
decomposition into uvwxy that
satisfies the constraint that for all i ≥ 0,
uviwxiy is in L
Because |vwx| ≤ n, vx can contain at most two symbols
[1v. . .w. . .xna] 1a2. . . anb1b2. . . bnc1c2. . .cn
Two cases to consider:
1. Both v and x contain only one type of alphabet
symbol: v does not contain both as and bs or both
bs and cs, and the same holds for x. But in this
case uv2wx2y cannot contain equal numbers of as,
bs, and cs
2. Either v or x contain more than one type of
symbol: in this case uv2wx2y may contain equal
numbers of as, bs, and cs but they won't be in the
correct order
• One of these cases must occur, and both result in
contradiction
• So the assumption that L is a CFL is false
Example
L = {ww | w  {0,1}*} is not a CFL.
• Suppose it were. Then let n be the PL constant for
L
• Choosing a string is less obvious for this language
– Try z = 0n10n1
– But it can be pumped by dividing as follows:
0n1
0n1
000…000 0 1 0 000…0001
u
v w x
y
• Try another string: 0n1n0n1n
– seems to capture more of the "essence" of the language
• Use PL condition that the string can be pumped
by dividing into z = uvwxy, where |vwx|  n
• vwx must straddle the midpoint of z.
Otherwise, if only in the first half of z, pumping
up to uv2wx2y moves a 1 into the first position of
the second half, so it cannot be of form ww. If in
the second half, a 0 is moved into the last
position of the first half, so cannot be of form
ww.
• If vwx straddles the midpoint of z, pumping z
down to uwy yields 0n1i0j1n, where i and j cannot
both be n. This string cannot be of form ww.
Contradiction!
Example
L = {aibjck | i < j < k} is not a CFL
Suppose it were. Then let n be the PL constant for L.
Consider z = anbn+1cn+2. We can write z = uvwxy,
with |vwx|  n and |vx| > 0, and uviwxiy  L for
every i  0
This time must pump down as well as pump up.
First we consider the case where vx contains at least one a. Then
since |vwx|  n, vx can contain no cs. Therefore, uv2wx2y has at
least n + 1 as and exactly n + 2 cs, which is impossible for strings in
L.
If vx contains no as, then it must contain either b or c. In this case,
uv0wx0y = uwy has either fewer than n + 1 bs or fewer than n + 2 cs,
but in either case exactly no as. This is also impossible for strings in
L.
By proof by contradiction, L is not a CFL.
Example
L = {aibjck | 0  i  j  k} is not a CFL
Suppose it were. Then let n be the PL constant for L.
Consider z = anbncn. We can write z = uvwxy, with
|vwx|  n and |vx| > 0, and uviwxiy  L for every i  0
When both v and x contain only one type of symbol, v does not
contain both as and bs and bs or cs and the same holds for x.
Must divide into three sub-cases:
1.
2.
3.
No as. Then try pumping down to obtain uv0wx0y = uwy. Contains too few
bs or cs.
No bs. Then either as or cs must appear in v or x because both can’t be the
empty string. If a’s appear, then uv2wx2y contains more as than bs. If c’s
appear, then uv0wx0y contains more bs than cs.
No cs. The string uv2wx2y contains more as or more bs than cs.
When either v or x contain more than one type of symbol,
uv2wx2y will not contain symbols in the correct order.
By proof by contradiction, L is not a CFL
Example
L = {xyx | x,y  {a,b}* and |x|≥ 1} is not a CFL
Suppose it were. Then let n be the PL constant for L.
Let z = anbnanbn (y = ε). Then z = uvwxy for
some u, v, w, x, and y, satisfying |vx| > 0,
|vwx|  n, and uviwxiy  L for every i  0
Suppose that vx contains either only as from the first group or only
bs from the last group. Then uv2wx2y is either an+ibnanbn or
anbnanbn+i for some 0 < i  n, and in neither case can this string be
in the form xyx for any x with |x| > 0.
Otherwise, vx contains either a b from the first group or an a from
the second. In this case uv0wx0y is either aibjakbn or anbiajbk where
in either case i and k are positive and j < n. Neither of these strings
can be in the form required for L either.
By proof by contradiction, L is not a CFL.
Example
L=
2
k
{0
| k is any integer} is not a CFL
• Suppose it were. Then let n be the PL
constant for L.
• Consider z = 0n2
• We can write z = uvwxy, with |vwx|  n and
|vx| > 0, and for all i ≥ 0, uviwxiy is in .
• Then uv2wx2y should be in L
• But n2 < |uv2wx2y|  n2 + n < (n + 1)2, so there
is no perfect square that |uv2wx2y| could be
• By proof by contradiction, L is not a CFL
Context-free Pumping Lemma
Broken Proofs, Etc.
L=
i
j
k
{a b c
| k = max(i, j)}
•
Assume L is context free, with pumping length p
•
Let s= apbpcp
•
By the Pumping Lemma, s = uvwxy, satisfying the three conditions. By
the length condition, if vwx contains characters of a single type, we are
done, by "pumping down" or "pumping up".
•
Otherwise, vwx cannot contain both a and c.
•
The remaining possibilities are:
•
–
vx contains c. Then the number of cs in uv0wx0y is less than p (there are p of them altogether in s),
while the maximum of i and j in uv0wx0y is still p. Contradiction.
–
vx does not contain c. In this case, "pumping up" implies that either the number of as or bs can be
increased without altering the number of cs. Again, contradiction.
What’s wrong?
–
Need to consider cases where vx spans two symbols.
–
Need to be more explicit: The constraint on the language is that the number of cs is the maximum of i
and j. Pumping the symbols that appear the minimum of i and j times won’t affect the validity of the
string until the value exceeds the maximum of i and j.
L={wtwR |w,t∈{a,b}∗ and|w|=|t|}
• How to choose s?
– The idea here is that the power of context-free languages
allows us to match w with wR or check that |w| = |t|, but not
both
– Choose s = apbpap
– Problem?
• If vx is all bs we can pump up or down and the string will still be in
the language (Why?)
– Choose s = apbpapbpbpap
• if we pump up or down within any window of p characters in that
string, the result will no longer be in the language
– Problem?
• There is not enough detail about why pumping would fail.
L=
n
2n
3n
{a ba ba
| n ≥ 0}
• Assume that L is context-free and there exists a pumping length p
• The string s = apba2pba3p seems to be a natural choice for showing
that the pumping lemma fails
• When we partition s as uvwxy, we have the following cases:
– Either v or x contains a b. In this case uv2wx2y has more than two bs
and thus the string is not in the language
– v and x contain only as. We partition all as from s into three segments:
the first ap, the middle a2p and the last a3p. According to the third
condition of the pumping lemma, the length of vwx is at most p. This
means that v and x can contain as from at most two segments, and
pumping the string up to uv2wx2y will violate the 1:2:3 ratio of as (and
the string is no longer in the language).
• One of the above options must happen, and thus the pumping
lemma fails on all partitionings of s
• Any problems here?
– Need to consider the case where v and x contain as from either the first and
middle segments or the middle and last segments.
n
m
n
L={a b a
|n,m≥0 and n≥m}
• Let s = apbpap
• Because of the constraint |vwx| ≤ p, we have only the following
choices for partitioning s into uvwxy:
– v or x contain at least one a from the first block of as: pumping up or
down in this case results in a mismatch with the second block of as
since none of as from the second block can be in v or x.
– v or x contain at least one a from the last block of as: similar to the
above case (v or x cannot reach the as in the beginning of the string).
– v and x are contained within the bs: pumping up will result in violating
the n ≥ m constraint since the number of bs will exceed the number of
a’s in each part (because |vx| > 0)
• Are we done?
– It would be better to explicitly consider vx spanning as and bs, where
either
• one of v or x consists of as and the other consists of bs
• one of v or x consists of as followed by bs or bs followed by as
Download