Chapter 4 Context-Free Languages 1 Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Using Grammar Rules to Define a Language • Regular languages and FAs are too simple for many purposes – Using context-free grammars allows us to describe more interesting languages – Much high-level programming language syntax can be expressed with context-free grammars – Context-free grammars with a very simple form provide another way to describe the regular languages • Grammars can be ambiguous • We will study how derivations can be related to the structure of the string being derived Introduction to Computation 2 Using Grammar Rules to Define a Language (cont’d.) • A grammar is a set of rules, usually simpler than those of English, by which strings in a language can be generated • Consider the language AnBn = {anbn | n 0}, defined using the recursive definition: – AnBn – For every S AnBn, aSb AnBn • Think of S as a variable representing an arbitrary element, and write these rules as S S aSb (In the process of obtaining an element of AnBn, S can be replaced by either string) Introduction to Computation 3 Using Grammar Rules to Define a Language (cont’d.) • If and are strings, and contains at least one occurrence of S, then means that is obtained from in one step, by using one of the two rules to replace a single occurrence of S by either or aSb • For example, we could write: S aSb aaSbb aaaSbbb aaabbb to describe a derivation of the string aaabbb • We can simplify the rules by using the | symbol to mean “or”, so that the rules become S | aSb Introduction to Computation 4 Context-Free Grammars: Definitions and More Examples • Definition: A context-free grammar (CFG) is a 4-tuple G=(V, , S, P), where V and are disjoint finite sets, S V, and P is a finite set of formulas of the form A , where A V and (V ∪ )* – Elements of are terminal symbols, or terminals, and elements of V are variables, or nonterminals – S is the start variable, and elements of P are grammar rules, or productions – We use for productions in a grammar and for a step in a derivation – The notations n and * refer to n steps and zero or more steps, respectively Introduction to Computation 5 Context-Free Grammars: Definitions and More Examples (cont’d.) • We will sometimes write G to indicate a derivation in a particular grammar G • means that there are strings 1, 2, and in (V ∪ )* and a production A in P such that = 1A2 and = 12 – This is a single step in a derivation • What makes the grammar context-free is that the production above, with left side A, can be applied wherever A occurs in the string (irrespective of the context; i.e., regardless of what 1 and 2 are) Introduction to Computation 6 Context-Free Grammars: Definitions and More Examples (cont’d.) • Definition: If G = (V, , S, P) is a CFG, the language generated by G is L(G) = { x * | S G* x} (S is the start variable, and x is a string of terminals) • A language L is a context-free language (CFL) if there is a CFG G with L = L(G) Introduction to Computation 7 Context-Free Grammars: Definitions and More Examples (cont’d.) • Consider AEqB = {x {a,b}* | na(x) = nb(x)} • Let’s develop a CFG for AEqB • If x is a non-null string in AEqB then either x = ay, where y Lb = {z | nb(z) = na(z) + 1}, or x = by, where y La = {z | na(z) = nb(z) + 1} – We represent Lb by the variable B and La by the variable A – The productions so far are S | aB | bA – All we need now are productions for A and B Introduction to Computation 8 Context-Free Grammars: Definitions and More Examples (cont’d.) • If a string x La starts with a, then the remainder is a member of AEqB • If it starts with b, the rest has two more a’s than b’s • Observation: a string containing two more a’s than b’s must be the concatenation of two strings, each with one more a; similarly with a and b reversed • The grammar resulting from these observations is S | aB | bA A aS | bAA B bS | aBB (Note: if A were the start variable, it would generate La) Introduction to Computation 9 Context-Free Grammars: Definitions and More Examples (cont’d.) • Theorem 4.9: If L1 and L2 are CFLs over , then so are L1 ∪ L2, L1L2, and L1* • Suppose G1 and G2 are CFGs that generate L1 and L2 respectively, and assume that they have no variables in common • Suppose that S1 and S2 are the start variables. Su, Sc and Sk , the start variables of the new grammars, will be new variables. – Gu just adds the rules Su S1 | S2 to G1 and G2 – Gc just adds the rule Sc S1S2 to G1 and G2 – Gk just adds the rules Sk | SkS1 to G1 Introduction to Computation 10 Regular Languages and Regular Grammars • The three operations in Theorem 4.9 are the ones involved in the recursive definition of regular languages • The “basic” regular languages over , and {}, are easily seen to be CFLs • Now we can prove by structural induction that every regular language over is a CFL • In fact, however, the CFG can be of a simpler form. Definition 4.13: A context-free grammar is regular if every production is of the form A B or A Introduction to Computation 11 Regular Languages and Regular Grammars (cont’d.) • Theorem 4.14: For every language L *, L is regular if and only if L = L(G) for some regular grammar G • Proof: – If L is a regular language, then there is a FA M=(Q, , q0, A, ) that accepts it – Define G=(V, , S, P) by letting V be Q, S the initial state q0, and P the set containing the production T aU for every transition (T, a) = U in M and the production T for every accepting state T of M Introduction to Computation 12 Regular Languages and Regular Grammars (cont’d.) • G is a regular grammar, and G accepts the same language as M – For every x = a1a2…an, the transitions on these symbols that start at q0 end at an accepting state if and only if there is a derivation of x in G • To prove the other direction we can start with a regular grammar G and reverse the construction to produce M – M may be an NFA, but it still accepts L(G), and it follows that L(G) is regular Introduction to Computation 13 Derivation Trees and Ambiguity • So far we’ve been interested in what strings a CFG generates • It is also useful to consider how a string is generated by a CFG • A derivation may provide information about the structure of a string, and if a string has several possible derivations, one may be more appropriate than another • We can draw trees to represent derivations Introduction to Computation 14 Derivation Trees and Ambiguity (cont’d.) • The root node represents the start variable S • Any interior node and its children represent a production A used in the derivation; the node represents A, and the children, from left to right, represent the symbols in . • Each leaf node represents a symbol or • The string derived is read off from left to right, ignoring ’s • Every derivation has exactly one derivation tree, but a tree can represent more than one derivation Introduction to Computation 15 Derivation Trees and Ambiguity (cont’d.) • In a derivation, at each step some production is applied to some occurrence of a variable • Consider a derivation that starts S S + S. We could apply a production to either the first or second of the S’s, but the resulting trees would be the same • When we talk about a string having several possible derivations, one being more appropriate, we are talking about derivations corresponding to different trees Introduction to Computation 16 Derivation Trees and Ambiguity (cont’d.) • We can distinguish between trivially different derivations and essentially different ones by specifying that in a derivation, we always choose the left-most variable to expand • Definition 4.16: A derivation in a CFG is a leftmost derivation (LMD) if, at each step, a production is applied to the leftmost variable-occurrence in the current string – A rightmost derivation is defined similarly Introduction to Computation 17 Derivation Trees and Ambiguity (cont’d.) • Theorem 4.17: If G is a CFG, then for any x L(G) these three statements are equivalent: – x has more than one derivation tree – x has more than one LMD – x has more than one RMD • Proof: see book • Definition 4.18: A CFG G is ambiguous if, for at least one x L(G), x has more than one derivation tree (or equivalently, according to Theorem 4.17, more than one LMD) Introduction to Computation 18 Derivation Trees and Ambiguity (cont’d.) • A classic example of ambiguity is the dangling else • In C, an if-statement can be defined by S if ( E ) S | if ( E ) S else S | OS (where OS stands for “other statement”) • Consider the statement if (e1) if (e2) f(); else g(); – In C, the else to belong to the second if, but this grammar does not rule out the other interpretation • The two derivation trees shown on the next slide show the two interpretations of a dangling else Introduction to Computation 19 Introduction to Computation 20 Derivation Trees and Ambiguity (cont’d.) • Clearly the grammar given is ambiguous, but there are equivalent grammars that allow only the correct interpretation • Example: S S1 | S2 S1 if ( E ) S1 else S1 | OS S2 if ( E ) S | if ( E ) S1 else S2 Introduction to Computation 21 Derivation Trees and Ambiguity (cont’d.) Consider the CFG G : S S + S | S * S | (S) | a • G generates simple algebraic expressions • One reason for ambiguity is that the relative precedence of + and * hasn’t been specified: a+a*a could be interpreted as (a+a)*a or as a+(a*a) • In fact, S S + S causes ambiguity by itself, because a+a+a could be interpreted as either (a+a)+a or a+(a+a). Similarly for S S * S • We might try to correct both problems by using the productions S S + T | T T T + F | F (think of T as “term” and F as “factor”) Introduction to Computation 22 Derivation Trees and Ambiguity (cont’d.) • * now has higher precedence than + (all the multiplications are performed within a term) • By making the production S S + T, not S T + S, we make + associate to the left. Similarly for * • We want parenthetical expressions to be evaluated first; this means we should consider such an expression to be part of a factor. The resulting unambiguous CFG generating L(G) is S S + T | T T T * F | F F (S) | a (proofs of unambiguity and equivalence are both somewhat complicated) Introduction to Computation 23 Simplified Forms and Normal Forms • Questions about the strings generated by a CFG are sometimes easier to answer if we know something about the form of the productions – For example, if we know that a grammar has no -productions and no unit productions (A B) we can deduce that no derivation of a string x can take more than 2|x| - 1 steps (see book for details). We could then, in principle, determine whether x can be derived by considering derivations no longer than this • We show how to modify an arbitrary CFG to have no productions of either of these types Introduction to Computation 24 Simplified Forms and Normal Forms (cont’d.) • Suppose we have the production A BCDCB, and can be derived from either B or C. If we get rid of -productions, then the steps that replace B and C by will no longer be possible, but we must still be able to get all the same non-null strings from A • We must retain the production A BCDCB but we should add A CDCB, A DCB, A BDCB, and so on • We will need to know what variables can derive (we will call such a variable a nullable variable) Introduction to Computation 25 Simplified Forms and Normal Forms (cont’d.) • Definition 4.26: A recursive definition of the set of nullable variables of G – If there is a production A then A is nullable – If A1, A2, …, Ak are nullable variables and there is a production B A1A2… Ak , then B is nullable • This leads immediately to an algorithm for identifying the nullable variables Introduction to Computation 26 Simplified Forms and Normal Forms (cont’d.) • Theorem 4.27: For every CFG G = (V, , S, P) the following algorithm produces a CFG G1=(V, , S, P1) having no -productions for which L(G1) = L(G) – {} – Identify the nullable variables in V and initialize P1 to P – For every production A in P, add to P1 every production obtained by deleting from one or more variable-occurrences involving a nullable variable – Delete every -production from P1, as well as every production of the form A A Introduction to Computation 27 Simplified Forms and Normal Forms (cont’d.) • The procedure we use to eliminate unit productions is similar • We first identify pairs of variables (A, B) for which A * B (in this case we call B A-derivable); then for each such pair (A, B) and each nonunit production B , we add the production A • Such pairs can be found as follows: – If A B is a production, then B is A-derivable – If C is A-derivable and C B is a production, then B is A-derivable – No other variables are A-derivable Introduction to Computation 28 Simplified Forms and Normal Forms (cont’d.) • Theorem 4.28: For every CFG G = (V, , S, P) without -productions, the CFG G1=(V, , S, P1) produced by the following algorithm generates the same language as G and has no unit productions: – Initialize P1 to P, and for each A V, identify the A-derivable variables – For every such pair A B and every nonunit production B , add the production A to P1 – Delete all unit productions from P1 Introduction to Computation 29 Simplified Forms and Normal Forms (cont’d.) • Definition 4.29: A CFG is said to be in Chomsky normal form if every production is of one of these two types: A BC (where B and C are variables) A (where is a terminal) • Theorem 4.30: For every context-free grammar G, there is another CFG G1 in Chomsky normal form such that L(G1) = L(G) – {} • The algorithm on the next slide shows how to generate G1 Introduction to Computation 30 Simplified Forms and Normal Forms (cont’d.) • The first step is to eliminate -productions and unit productions • The second step is to introduce for every terminal symbol a new variable X and production X • In every production, replace every terminal by its new variable (except for the new productions above) • Replace a production like A BACB by the productions A BY1, Y1 AY2, Y2 CB, where Y1 and Y2 are new variables • The resulting CFG is in Chomsky normal form Introduction to Computation 31