Chomsky Normal Form • We skipped this section even though it appears earlier in the text: Chomsky Normal Form (CNF) • only production forms are – – A BC A a – S • …where A, B, C are nonterminals, a is a terminal, S is the start symbol and is the empty string. • Every CNF grammar is a CFG – Every CFG can be transformed into an equivalent CNF grammar • We will use CNF conversion algorithms to clean up needlessly complex grammars. • Recommended: Check out his Wikipedia & FB pages ! Cleaning Up Grammars • We can "simplify" grammars to a great extent, e.g.: 1. Get rid of -productions • • Variables of the form variable But you lose the ability to generate as a string in the language 2. Get rid of useless symbols • Variables that do not participate in any derivation of a terminal string 3. Get rid of unit productions • Variables of the form variable variable Any CFG can be converted via these and other methods to Chomsky Normal Form (CNF) • Again, the only production forms are – – A BC A a Getting Rid of the Empty String • No, didn’t forget S → ε… • Empty string is a nuisance with grammars and languages in general • We will look at languages that do not contain • No loss of generality: – For language L, let G = (V,T,S,P) be a CFG that generates L - {} – Modify grammar by adding a new start variable S0 and add productions S0 S | – This grammar generates L – Therefore any non-trivial conclusion we make for L {} should transfer to L Eliminating -Productions • A variable A is nullable if A * • Find them by a recursive algorithm: – Basis: If A is a production, then A is nullable – Induction: If A is the head of a production whose body consists of only nullable symbols, then A is nullable Once we have the nullable symbols, we can add additional productions and then throw away the productions of the form A for any A • If A X1X2 …Xk is a production, add all productions that can be formed by eliminating some or all of those Xi's that are nullable But, don't eliminate all k if they are all nullable Example – If A BC is a production, and both B and C are nullable, add A B | C Example Grammar: S aA A aABC | bB | a Bb| Cc| Add productions to account for strings generated when one or more RHS symbols go to S aA A aABC | bB | a | aAB | aA | aAC | b Bb| Cc| Nullable: • C, B are nullable, derive • Neither A nor S is nullable (no right hand side with all nullable symbols) Eliminate -productions: S aA A aABC | bB | a | aAB | aA | aAC | b Bb Cc Resulting grammar with no -productions Useless Symbols • In order for a symbol X to be useful, it must: 1. Derive some terminal string (possibly X is a terminal) 2. Be reachable from the start symbol; i.e., S* X • Note that X wouldn't really be useful if or included a symbol that didn't satisfy (1), so it is important that (1) be tested first, and symbols that don't derive terminal strings be eliminated before testing (2) Finding Symbols That Don't Derive Any Terminal String • Recursive construction: – Basis: A terminal surely derives a terminal string – Induction: If A is the head of a production whose body is X1X2 …Xk, and each Xi is known to derive a terminal string, then surely A derives a terminal string • Keep going until no more symbols that derive terminal strings are discovered Example S AB | C A 0B | C B 1 | A0 C AC | C1 Round 1: 0 and 1 are "in" Round 2: B 1 says B is in Round 3: A 0B says A is in Round 4: S AB says S is in Round 5: Nothing more can be added • Thus, C can be eliminated, along with any production that mentions it, leaving S AB; A 0B; B 1 | A0 Finding Symbols That Can't Be Derived From the Start Symbol • Another recursive algorithm: – Basis: S is "in" – Induction: If variable A is in, then so is every symbol in the production bodies for A • Keep going until no more symbols derivable from S can be found Example S AB A 0B B 1 | A0 Round 1: S is in Round 2: A and B are in Round 3: 0 and 1 are in Round 4: Nothing can be added In this case, all symbols are derivable from S, so no change to grammar • Book has an example where not only are there symbols not derivable from S, but you must eliminate first the symbols that don't derive terminal strings, or you get the wrong grammar Eliminating Unit Productions 1. Eliminate useless symbols and -productions 2. Discover those pairs of variables (A, B) such * AB that – Because there are no -productions, this derivation can only use unit productions 3. Replace each combination where A* B* and is other than a single variable by A – I.e., "short circuit" sequences of unit productions, which must eventually be followed by some other kind of production 4. Remove all unit productions Chomsky Normal Form 1. Get rid of useless symbols, -productions, and unit productions (already done) 2. Get rid of productions whose bodies are mixes of terminals and variables, or consist of more than one terminal 3. Break up production bodies longer than 2 Result All productions are of the form A BC or A a No Mixed Bodies 1. For each terminal a, introduce a new variable Aa, with one production Aa a 2. Replace a in any body where it is not the entire body by Aa – Now, every body is either a single terminal or it consists only of variables Example • A 0B1 becomes A0 0; A1 1; A A0BA1 Example: Earlier Grammar • Grammar from which -productions were removed – Contained no unit productions or useless symbols S aA A aABC | bB | a | aAB | aA | aAC | b Bb Cc S aA A aABC | bB | a | aAB | aA | aAC | b Bb Cc Already Aa a have variables for S AaA b and c A AaABC | BB | a | AaAB | AaA | AaAC | b Bb Cc Aa a Making Bodies Short • If we have a production like A BCDE, we can introduce some new variables that allow the variables of the body to be introduced one at a time – A body of length k requires k - 2 new variables Example – Introduce F and G; replace A BCDE by A BF; F CG; G DE Example: Earlier Grammar S AaA A AaABC | BB | a | AaAB | AaA | AaAC | b Bb Cc Aa a S AaA A AaD | BB | a | AaAB | AaA | AaAC | b Bb Cc S AaA Aa a A AaD | BB | a | AaF | AaA | AaG | b D AE Bb E BC Cc D AE Chomsky E BC Normal Form! F AB G AC Full Procedure • Perform each step in order: 1. 2. 3. 4. 5. Eliminate -productions Eliminate useless symbols Eliminate unit productions Eliminate mixed bodies Make all bodies short Summary Theorem If L is any CFL, there is a grammar G that generates L {}, for which each production is of the form A BC or A a, and there are no useless symbols CFL Pumping Lemma • Similar to regular-language PL, but you have to pump two strings in the middle of the string, in tandem (i.e., the same number of copies of each). Formally: – CFL L – – – – integer n z in L, with |z| n uvwxy = z such that |vwx| n and |vx| > 0 i 0, uviwxiy is in L The part of the string containing the pumped bit does not have to start at the beginning of the string! Pumping a regular language ... Can take this loop 0 or more times, with no way to control the number of iterations Pumping a context free language Stack ensures that the number of passes through these two loops is coordinated ... If you take one loop you must take the other, but no way to control the number of iterations ... Outline of Proof of PL • Let there be a CFG for L • Let b be the maximum number of symbols on the right hand side of a rule (assume at least 2) – No node can have more than b children • At most b leaves are 1 step from the start variable, at most b2 leaves are within 2 steps from the start variable, at most bh leaves are within h steps from the start variable – If height of the tree is at most h, length of generated string is at most bh – Conversely, if the generated string is at least bh +1 long, parse tree must be h +1 high – Thus, some variable must appear twice on the path • Compare with the DFA argument about a path longer than the number of states S S A A A u v w x w S y u A A A u v v w x y x y • A variable can be replaced by one of its right hand sides any number of times • By repeatedly replacing the lower A's tree by the upper A's tree, we see uviwxiy has a parse tree for all i > 1 – And replacing the upper by the lower shows the case i = 0; i.e., uwy is in L Consider the derivation S* uAy*uvAxy*uvwxy S u A y * vAx A A v * w A x uviwxiy L w Pumping Length • Pumping lemma constant for CFLs is b|V|+1 where V is the number of variables in the grammar and b is the length of the longest RHS – The derivation tree for a sufficiently long string must have a height of at least |V|+ 1 – It has at least b|V|+1 leaf nodes (by definition), and therefore its height is equal to or greater than b|V| + 1 • Consider a leaf and the b + 1 nodes above it: since there are only b variables, one must appear twice Using the CFL Pumping Lemma to Prove a Language is not Context-free The classic non-CFL Example L = {aibici | i 0} is not a CFL. • • Suppose it were. Then let n be the PL constant for L. Consider z = anbncn. We can write z = uvwxy, with |vwx| n and |vx| > 0 (i.e., either v or x is nonempty), and for all i ≥ 0, uviwxiy is in L. Note that unlike the PL for regular languages, the pumpable part (vx) need not start at the beginning of the string N.B. • As with the pumping lemma proof for regular languages, must show there is at least one string for which there is no decomposition into uvwxy that satisfies the constraint that for all i ≥ 0, uviwxiy is in L Because |vwx| ≤ n, vx can contain at most two symbols [1v. . .w. . .xna] 1a2. . . anb1b2. . . bnc1c2. . .cn Two cases to consider: 1. Both v and x contain only one type of alphabet symbol: v does not contain both as and bs or both bs and cs, and the same holds for x. But in this case uv2wx2y cannot contain equal numbers of as, bs, and cs 2. Either v or x contain more than one type of symbol: in this case uv2wx2y may contain equal numbers of as, bs, and cs but they won't be in the correct order • One of these cases must occur, and both result in contradiction • So the assumption that L is a CFL is false Example L = {ww | w {0,1}*} is not a CFL. • Suppose it were. Then let n be the PL constant for L • Choosing a string is less obvious for this language – Try z = 0n10n1 – But it can be pumped by dividing as follows: 0n1 0n1 000…000 0 1 0 000…0001 u v w x y • Try another string: 0n1n0n1n – seems to capture more of the "essence" of the language • Use PL condition that the string can be pumped by dividing into z = uvwxy, where |vwx| n • vwx must straddle the midpoint of z. Otherwise, if only in the first half of z, pumping up to uv2wx2y moves a 1 into the first position of the second half, so it cannot be of form ww. If in the second half, a 0 is moved into the last position of the first half, so cannot be of form ww. • If vwx straddles the midpoint of z, pumping z down to uwy yields 0n1i0j1n, where i and j cannot both be n. This string cannot be of form ww. Contradiction! Example L = {aibjck | i < j < k} is not a CFL Suppose it were. Then let n be the PL constant for L. Consider z = anbn+1cn+2. We can write z = uvwxy, with |vwx| n and |vx| > 0, and uviwxiy L for every i 0 This time must pump down as well as pump up. First we consider the case where vx contains at least one a. Then since |vwx| n, vx can contain no cs. Therefore, uv2wx2y has at least n + 1 as and exactly n + 2 cs, which is impossible for strings in L. If vx contains no as, then it must contain either b or c. In this case, uv0wx0y = uwy has either fewer than n + 1 bs or fewer than n + 2 cs, but in either case exactly no as. This is also impossible for strings in L. By proof by contradiction, L is not a CFL. Example L = {aibjck | 0 i j k} is not a CFL Suppose it were. Then let n be the PL constant for L. Consider z = anbncn. We can write z = uvwxy, with |vwx| n and |vx| > 0, and uviwxiy L for every i 0 When both v and x contain only one type of symbol, v does not contain both as and bs and bs or cs and the same holds for x. Must divide into three sub-cases: 1. 2. 3. No as. Then try pumping down to obtain uv0wx0y = uwy. Contains too few bs or cs. No bs. Then either as or cs must appear in v or x because both can’t be the empty string. If a’s appear, then uv2wx2y contains more as than bs. If c’s appear, then uv0wx0y contains more bs than cs. No cs. The string uv2wx2y contains more as or more bs than cs. When either v or x contain more than one type of symbol, uv2wx2y will not contain symbols in the correct order. By proof by contradiction, L is not a CFL Example L = {xyx | x,y {a,b}* and |x|≥ 1} is not a CFL Suppose it were. Then let n be the PL constant for L. Let z = anbnanbn (y = ε). Then z = uvwxy for some u, v, w, x, and y, satisfying |vx| > 0, |vwx| n, and uviwxiy L for every i 0 Suppose that vx contains either only as from the first group or only bs from the last group. Then uv2wx2y is either an+ibnanbn or anbnanbn+i for some 0 < i n, and in neither case can this string be in the form xyx for any x with |x| > 0. Otherwise, vx contains either a b from the first group or an a from the second. In this case uv0wx0y is either aibjakbn or anbiajbk where in either case i and k are positive and j < n. Neither of these strings can be in the form required for L either. By proof by contradiction, L is not a CFL. Example L= 2 k {0 | k is any integer} is not a CFL • Suppose it were. Then let n be the PL constant for L. • Consider z = 0n2 • We can write z = uvwxy, with |vwx| n and |vx| > 0, and for all i ≥ 0, uviwxiy is in . • Then uv2wx2y should be in L • But n2 < |uv2wx2y| n2 + n < (n + 1)2, so there is no perfect square that |uv2wx2y| could be • By proof by contradiction, L is not a CFL Context-free Pumping Lemma Broken Proofs, Etc. L= i j k {a b c | k = max(i, j)} • Assume L is context free, with pumping length p • Let s= apbpcp • By the Pumping Lemma, s = uvwxy, satisfying the three conditions. By the length condition, if vwx contains characters of a single type, we are done, by "pumping down" or "pumping up". • Otherwise, vwx cannot contain both a and c. • The remaining possibilities are: • – vx contains c. Then the number of cs in uv0wx0y is less than p (there are p of them altogether in s), while the maximum of i and j in uv0wx0y is still p. Contradiction. – vx does not contain c. In this case, "pumping up" implies that either the number of as or bs can be increased without altering the number of cs. Again, contradiction. What’s wrong? – Need to consider cases where vx spans two symbols. – Need to be more explicit: The constraint on the language is that the number of cs is the maximum of i and j. Pumping the symbols that appear the minimum of i and j times won’t affect the validity of the string until the value exceeds the maximum of i and j. L={wtwR |w,t∈{a,b}∗ and|w|=|t|} • How to choose s? – The idea here is that the power of context-free languages allows us to match w with wR or check that |w| = |t|, but not both – Choose s = apbpap – Problem? • If vx is all bs we can pump up or down and the string will still be in the language (Why?) – Choose s = apbpapbpbpap • if we pump up or down within any window of p characters in that string, the result will no longer be in the language – Problem? • There is not enough detail about why pumping would fail. L= n 2n 3n {a ba ba | n ≥ 0} • Assume that L is context-free and there exists a pumping length p • The string s = apba2pba3p seems to be a natural choice for showing that the pumping lemma fails • When we partition s as uvwxy, we have the following cases: – Either v or x contains a b. In this case uv2wx2y has more than two bs and thus the string is not in the language – v and x contain only as. We partition all as from s into three segments: the first ap, the middle a2p and the last a3p. According to the third condition of the pumping lemma, the length of vwx is at most p. This means that v and x can contain as from at most two segments, and pumping the string up to uv2wx2y will violate the 1:2:3 ratio of as (and the string is no longer in the language). • One of the above options must happen, and thus the pumping lemma fails on all partitionings of s • Any problems here? – Need to consider the case where v and x contain as from either the first and middle segments or the middle and last segments. n m n L={a b a |n,m≥0 and n≥m} • Let s = apbpap • Because of the constraint |vwx| ≤ p, we have only the following choices for partitioning s into uvwxy: – v or x contain at least one a from the first block of as: pumping up or down in this case results in a mismatch with the second block of as since none of as from the second block can be in v or x. – v or x contain at least one a from the last block of as: similar to the above case (v or x cannot reach the as in the beginning of the string). – v and x are contained within the bs: pumping up will result in violating the n ≥ m constraint since the number of bs will exceed the number of a’s in each part (because |vx| > 0) • Are we done? – It would be better to explicitly consider vx spanning as and bs, where either • one of v or x consists of as and the other consists of bs • one of v or x consists of as followed by bs or bs followed by as