Ambiguity in grammars and languages

advertisement
Section 5.2 Parsing and Ambiguity
Rather than just deriving strings from grammars we also want to be able to determine given a
grammar G and a string w whether w  L(G). An algorithm that can tell us whether w is in
L(G) is a membership algorithm. The term parsing describes finding a sequence of
productions by which a string w  L(G) is derived.
One way that we could try to decide if a string w was in L(G) would be to try all possible
derivations. Obviously, the first rule to be applied must have an S on the left hand side so we
will have a sentential form corresponding to the right hand side of each S rule. If the
sentential form is not w then the next step is to replace a variable in the sentential form by the
right hand side of a rule corresponding to that variable. Clearly we must try all possible ways
of replacing that variable leading to several more sentential forms, each of which must be
expanded until we have derived w or determined it is not possible to do so. If w  L(G) then
its derivation must be finite so that eventually we will be able to answer the question whether
w  L(G). Note: there is another algorithm that allows us to do this a bit more efficiently, but
we can’t discuss that until we look at some standard forms for grammars. This process may
be called exhaustive search parsing which is a form of top-down parsing. This may also be
thought of as building a derivation tree form the root down. See example 5.7 in the text for
more on how this works. Aside from being very inefficient, we may run into the problem of
going into an infinite loop for a string that is not in the language.
As discussed in the text, if the grammar has no productions of the form A   or A  B then
the method described above always terminates. This is Theorem 5.2 in the text. To see why
this works, notice that each step in the derivation increases the length of the sentential form or
the number of terminals in the sentential form. Since neither the length of the sentential form
nor the number of terminals can exceed |w|, the derivation requires at most 2|w| steps.
(Remember that no variable can be replaced by  so in the worst case we could get a string of
variables of length |w| and each of those variables would be replaced by a single terminal
symbol.) To see how inefficient this may be, look at an upper bound on the total number of
sentential forms--|P| + |P|2 + … + |P|2|w|.
As Theorem 5.3 indicates, we can parse or derive w more efficiently than this. The theorem
is stated below without proof since we haven’t yet developed a standard form for context-free
grammars that will enable us to show this.
Theorem 5.3: For every context-free grammar there exists an algorithm that parses any w 
L(G) in a number of steps proportional to |w|3.
Ideally we would like the time to parse a string w to be proportional to |w| i.e. a linear time
parsing algorithm. The following definition describes a special class of grammars for which
linear time parsing is possible.
Definiton 5.4 A context-free grammar G = (V, T, S, P) is said to be a simple grammar or sgrammar if all of its productions are of the form A  ax where A  V, a  T, x  V*, and any
pair (A, a) occurs at most once in P. (This means the grammar has no  productions since
the right hand side of every rule begins with a terminal symbol, and, if variable A is on the left
hand side of a production, the right-hand side of at most one A production begins with a.)
If G is an s-grammar then any string w in L(G) can be parsed with an effort proportional to |w|.
This holds because each step must produce a terminal symbol and thus at most |w| steps are
needed. Moreover, it is immediately clear which production must be used to substitute for the
leftmost variable since we know the string we’re trying to derive and there is at most one rule
of the form (A, a). Many features of programming languages can be described by s-grammars.
Let’s look back at some of the grammars we’ve seen before, beginning with the productions of
the grammar for L = {anbn | n  1}: S  aSb | ab Note that this is not an s-grammar since Sb
is not in V* (i.e. it doesn’t contain only variables) and since we have two productions of the
form (S, a). However, we can make this into an s-grammar using the following productions:
S  aA, A  aSB | b, B  b. Consider what would happen if we wanted to construct an sgrammar for {anbn | n  0}: S  aSB | , B  b. Can this also be modified? No, because the
definition of s-grammar implies that   L(G) and that the grammar has no -rules.
Now, let’s consider the grammar for odd-length palindromes: S  aSa | bSb | a | b. In this
case there is an equally simple modification that can be made: S  aSA | bSB, A  a, B 
b. We can’t get an s-grammar for the even length palindromes because one of the grammar
would have to have a -rule to stop a derivation. Suppose instead we try to get an sgrammar for the even length palindromes of length 2 or more by using a modification similar
to that for the previous example. We could use S  aSA | bSB, A  a, and B  b. However,
to terminate a derivation we would need S  aA or S  bB and that would give us two (S, a)
rules and two (S, b) rules which is not allowed.
Ambiguity in grammars and languages
If a grammar has two different leftmost (or rightmost) derivations for a string then the grammar
is said to be ambiguous. A language is ambiguous if every grammar that generates it is
ambiguous. (In general, this is very difficult to prove.)
Recall that in a previous example for strings with an equal number of a’s and b’s (S  aSbS |
bSaS | ) we had two leftmost derivations for the string abab. Thus, that grammar is
ambiguous.
In programming languages, ambiguity would mean that some statements would have more
than one interpretation. This is clearly something to be avoided since we need uniqueness to
guarantee that the code produced by a compiler is the same no matter which compiler we use
or how often we compile the same code. This is, without uniqueness, we could conceivably
obtain two different object files from the same program. Unlike with the DFA minimization,
there is no algorithm to remove ambiguity from a grammar. However, it is sometimes possible
to rewrite the grammar in a way that is not ambiguous.
Let's look at a couple of examples of removing ambiguity.
Example 1: S  bS | Sb | a
Clearly the corresponding language is b*ab*. The reason the grammar is ambiguous is
because it is possible to generate the b's in any order because the right hand side of the rule
S  Sb does not begin with a terminal symbol. For example, to generate the string aba we
could use either of these two derivations:
S  bS  bSb  bab
S  Sb  bSb  bab
To eliminate ambiguity, we can guarantee that the b’s are generated in a particular order.
One way to do this is to generate the leading b’s first, then the a and then the final b’s. There
are two straightforward ways to do this:
S  bS | aA
S  bS | A
A  bA | 
A  Ab | a
We’ll stick with the one on the left since it is more similar to an s-grammar. Looking at the left
grammar for example, it should be clear that to generate b nabm we must use the first
production n times, then the aA production and finally use the A production m times. S  bnS
 bnaA  bnabmA  bnabm.
Example 2: Consider the following grammar:
S  aS | aSbS | 
What’s the language?—strings of a’s and b’s which begin with a and in which the number of
a’s is greater than or equal to the number of b’s. The source of the ambiguity is that we have
two S productions that begin in the same way. One balances a’s and b’s and the other just
produces a’s.
S  aS  aaSbS  aabS  aab and
S  aSbS  aaSbS  aabS  aab are two distinct leftmost derivations of the string aab.
One way to get rid of the ambiguity is to introduce another nonterminal T that generates only
balanced a’s and b’s. That is, for every b we will generate an a at the same time. Thus, any
unmatched a’s must come from the production S  aS. Here are the productions:
S  aS | aTbS | 
T  aTbS | 
Now, let’s look at the derivation of aab again:
S  aS  aaTbS  aabS  aab uses the S  aS rule first. If we first try to use
S  aTbS, notice that the only way to get the second a is to use the T  aTbS rule. But,
that forces us to produce a second b. Thus, we cannot produce the string aab in this case.
Obviously, in a programming language we want to avoid ambiguity. One place that we have
all seen ambiguity is in algebraic expressions. Read Example 5.11 on page 141 and the
modified unambiguous grammar in Example 5.12 in the text.
If every grammar that generates a language L is ambiguous, then L is said to be an inherently
ambiguous language. Look at example 5.13 in the text. The claim there is that the language
L = {anbncm}  {anbmcm} is inherently ambiguous. Here’s another example. Note that we are
not proving the language is inherently ambiguous, but instead giving a rationale to justify the
claim.
Claim: L = {aibjck | i = j or j = k } is inherently ambiguous. Intuition—look at strings of the form
aibici—two sets of productions are needed—one to match a’s and b’s and the other to match
b’s and c’s.
Here’s one grammar for the language.
matches a’s and b’s.
The first S rule matches b’s and c’s and the second
S  AB | CD
A  aA | 
B  bBc | 
(generates b’s and c’s in pairs)
C  aCb | 
D  cD | 
(generates a’s and b’s in pairs)
Context-free grammars and programming languages.
Perhaps the most basic application of CFGs is to describe programming languages. In
particular, a parser is the part of a compiler which "knows" the rules of the grammar and
determines if a program has the correct syntax. Basically, the parser builds a parse tree. In a
similar manner, the Document Type Definition in XML is in effect a CFG that describes tags
and the ways in which they may be nested.
As we saw earlier, in determining if an input string corresponds to the rules of a grammar, a
top-down parser constructs derivations by applying rules to the leftmost variable in a
sentential form. Thus, it starts with S and works until a terminal string is obtained.
There are also bottom up parsers which effectively do the reverse—beginning with a string,
repeatedly reducing it until the symbol S is obtained. That is, it looks for the right hand side of
a production and replaces it with the variable from the left hand side. This corresponds to a
rightmost derivation. Here’s an example of bottom-up parsing.
V = {S, A, T}
T = {b, +, (, )}
P: 1. S  A
2. A  T
3. A  A + T
4. T  b
5. T  (A)
Let’s consider the following expression: (b) + b
Reduction Rule Replacement
(b) + b
(T) + b
4
replace b by T
(A) + b
2
replace T by A
T+b
5
replace (A) by T
A+b
2
replace T by A
A+T
4
replace b by T
T
3
replace A + T by T
A
2
replace T by A
S
1
replace A by S
Thus, we started with a string, replaced right hand sides of productions by their left hand sides
and ended up with the start symbol S.
Obviously, it can be very time consuming to generate a parser for a substantial grammar so
there are automatic ways to construct a parser for a given grammar. One is YACC which is a
command on UNIX systems. (It stands for Yet Another Compiler Compiler.) Basically, for
each production there is an associated action which is C code that is executed when a node
of the parse tree that corresponds to the production is created.
Download