Section 5.2 Parsing and Ambiguity Rather than just deriving strings from grammars we also want to be able to determine given a grammar G and a string w whether w L(G). An algorithm that can tell us whether w is in L(G) is a membership algorithm. The term parsing describes finding a sequence of productions by which a string w L(G) is derived. One way that we could try to decide if a string w was in L(G) would be to try all possible derivations. Obviously, the first rule to be applied must have an S on the left hand side so we will have a sentential form corresponding to the right hand side of each S rule. If the sentential form is not w then the next step is to replace a variable in the sentential form by the right hand side of a rule corresponding to that variable. Clearly we must try all possible ways of replacing that variable leading to several more sentential forms, each of which must be expanded until we have derived w or determined it is not possible to do so. If w L(G) then its derivation must be finite so that eventually we will be able to answer the question whether w L(G). Note: there is another algorithm that allows us to do this a bit more efficiently, but we can’t discuss that until we look at some standard forms for grammars. This process may be called exhaustive search parsing which is a form of top-down parsing. This may also be thought of as building a derivation tree form the root down. See example 5.7 in the text for more on how this works. Aside from being very inefficient, we may run into the problem of going into an infinite loop for a string that is not in the language. As discussed in the text, if the grammar has no productions of the form A or A B then the method described above always terminates. This is Theorem 5.2 in the text. To see why this works, notice that each step in the derivation increases the length of the sentential form or the number of terminals in the sentential form. Since neither the length of the sentential form nor the number of terminals can exceed |w|, the derivation requires at most 2|w| steps. (Remember that no variable can be replaced by so in the worst case we could get a string of variables of length |w| and each of those variables would be replaced by a single terminal symbol.) To see how inefficient this may be, look at an upper bound on the total number of sentential forms--|P| + |P|2 + … + |P|2|w|. As Theorem 5.3 indicates, we can parse or derive w more efficiently than this. The theorem is stated below without proof since we haven’t yet developed a standard form for context-free grammars that will enable us to show this. Theorem 5.3: For every context-free grammar there exists an algorithm that parses any w L(G) in a number of steps proportional to |w|3. Ideally we would like the time to parse a string w to be proportional to |w| i.e. a linear time parsing algorithm. The following definition describes a special class of grammars for which linear time parsing is possible. Definiton 5.4 A context-free grammar G = (V, T, S, P) is said to be a simple grammar or sgrammar if all of its productions are of the form A ax where A V, a T, x V*, and any pair (A, a) occurs at most once in P. (This means the grammar has no productions since the right hand side of every rule begins with a terminal symbol, and, if variable A is on the left hand side of a production, the right-hand side of at most one A production begins with a.) If G is an s-grammar then any string w in L(G) can be parsed with an effort proportional to |w|. This holds because each step must produce a terminal symbol and thus at most |w| steps are needed. Moreover, it is immediately clear which production must be used to substitute for the leftmost variable since we know the string we’re trying to derive and there is at most one rule of the form (A, a). Many features of programming languages can be described by s-grammars. Let’s look back at some of the grammars we’ve seen before, beginning with the productions of the grammar for L = {anbn | n 1}: S aSb | ab Note that this is not an s-grammar since Sb is not in V* (i.e. it doesn’t contain only variables) and since we have two productions of the form (S, a). However, we can make this into an s-grammar using the following productions: S aA, A aSB | b, B b. Consider what would happen if we wanted to construct an sgrammar for {anbn | n 0}: S aSB | , B b. Can this also be modified? No, because the definition of s-grammar implies that L(G) and that the grammar has no -rules. Now, let’s consider the grammar for odd-length palindromes: S aSa | bSb | a | b. In this case there is an equally simple modification that can be made: S aSA | bSB, A a, B b. We can’t get an s-grammar for the even length palindromes because one of the grammar would have to have a -rule to stop a derivation. Suppose instead we try to get an sgrammar for the even length palindromes of length 2 or more by using a modification similar to that for the previous example. We could use S aSA | bSB, A a, and B b. However, to terminate a derivation we would need S aA or S bB and that would give us two (S, a) rules and two (S, b) rules which is not allowed. Ambiguity in grammars and languages If a grammar has two different leftmost (or rightmost) derivations for a string then the grammar is said to be ambiguous. A language is ambiguous if every grammar that generates it is ambiguous. (In general, this is very difficult to prove.) Recall that in a previous example for strings with an equal number of a’s and b’s (S aSbS | bSaS | ) we had two leftmost derivations for the string abab. Thus, that grammar is ambiguous. In programming languages, ambiguity would mean that some statements would have more than one interpretation. This is clearly something to be avoided since we need uniqueness to guarantee that the code produced by a compiler is the same no matter which compiler we use or how often we compile the same code. This is, without uniqueness, we could conceivably obtain two different object files from the same program. Unlike with the DFA minimization, there is no algorithm to remove ambiguity from a grammar. However, it is sometimes possible to rewrite the grammar in a way that is not ambiguous. Let's look at a couple of examples of removing ambiguity. Example 1: S bS | Sb | a Clearly the corresponding language is b*ab*. The reason the grammar is ambiguous is because it is possible to generate the b's in any order because the right hand side of the rule S Sb does not begin with a terminal symbol. For example, to generate the string aba we could use either of these two derivations: S bS bSb bab S Sb bSb bab To eliminate ambiguity, we can guarantee that the b’s are generated in a particular order. One way to do this is to generate the leading b’s first, then the a and then the final b’s. There are two straightforward ways to do this: S bS | aA S bS | A A bA | A Ab | a We’ll stick with the one on the left since it is more similar to an s-grammar. Looking at the left grammar for example, it should be clear that to generate b nabm we must use the first production n times, then the aA production and finally use the A production m times. S bnS bnaA bnabmA bnabm. Example 2: Consider the following grammar: S aS | aSbS | What’s the language?—strings of a’s and b’s which begin with a and in which the number of a’s is greater than or equal to the number of b’s. The source of the ambiguity is that we have two S productions that begin in the same way. One balances a’s and b’s and the other just produces a’s. S aS aaSbS aabS aab and S aSbS aaSbS aabS aab are two distinct leftmost derivations of the string aab. One way to get rid of the ambiguity is to introduce another nonterminal T that generates only balanced a’s and b’s. That is, for every b we will generate an a at the same time. Thus, any unmatched a’s must come from the production S aS. Here are the productions: S aS | aTbS | T aTbS | Now, let’s look at the derivation of aab again: S aS aaTbS aabS aab uses the S aS rule first. If we first try to use S aTbS, notice that the only way to get the second a is to use the T aTbS rule. But, that forces us to produce a second b. Thus, we cannot produce the string aab in this case. Obviously, in a programming language we want to avoid ambiguity. One place that we have all seen ambiguity is in algebraic expressions. Read Example 5.11 on page 141 and the modified unambiguous grammar in Example 5.12 in the text. If every grammar that generates a language L is ambiguous, then L is said to be an inherently ambiguous language. Look at example 5.13 in the text. The claim there is that the language L = {anbncm} {anbmcm} is inherently ambiguous. Here’s another example. Note that we are not proving the language is inherently ambiguous, but instead giving a rationale to justify the claim. Claim: L = {aibjck | i = j or j = k } is inherently ambiguous. Intuition—look at strings of the form aibici—two sets of productions are needed—one to match a’s and b’s and the other to match b’s and c’s. Here’s one grammar for the language. matches a’s and b’s. The first S rule matches b’s and c’s and the second S AB | CD A aA | B bBc | (generates b’s and c’s in pairs) C aCb | D cD | (generates a’s and b’s in pairs) Context-free grammars and programming languages. Perhaps the most basic application of CFGs is to describe programming languages. In particular, a parser is the part of a compiler which "knows" the rules of the grammar and determines if a program has the correct syntax. Basically, the parser builds a parse tree. In a similar manner, the Document Type Definition in XML is in effect a CFG that describes tags and the ways in which they may be nested. As we saw earlier, in determining if an input string corresponds to the rules of a grammar, a top-down parser constructs derivations by applying rules to the leftmost variable in a sentential form. Thus, it starts with S and works until a terminal string is obtained. There are also bottom up parsers which effectively do the reverse—beginning with a string, repeatedly reducing it until the symbol S is obtained. That is, it looks for the right hand side of a production and replaces it with the variable from the left hand side. This corresponds to a rightmost derivation. Here’s an example of bottom-up parsing. V = {S, A, T} T = {b, +, (, )} P: 1. S A 2. A T 3. A A + T 4. T b 5. T (A) Let’s consider the following expression: (b) + b Reduction Rule Replacement (b) + b (T) + b 4 replace b by T (A) + b 2 replace T by A T+b 5 replace (A) by T A+b 2 replace T by A A+T 4 replace b by T T 3 replace A + T by T A 2 replace T by A S 1 replace A by S Thus, we started with a string, replaced right hand sides of productions by their left hand sides and ended up with the start symbol S. Obviously, it can be very time consuming to generate a parser for a substantial grammar so there are automatic ways to construct a parser for a given grammar. One is YACC which is a command on UNIX systems. (It stands for Yet Another Compiler Compiler.) Basically, for each production there is an associated action which is C code that is executed when a node of the parse tree that corresponds to the production is created.