Lecture Five: Context Free Grammar (CFG) Amjad Ali CFG, Lecture 5, slide Definition of Context-Free Grammar There are four important components in a grammatical description of a language: 1. There is a finite set of symbols that form the strings of the language being defined. This set was {0,1} in the palindrome example we just saw. We call this alphabet the terminals, or terminal symbols. 2. There is a finite set of variables, also called sometimes nonterminals or syntactic categories. Each variable represents a language; i.e., a set of strings. In our example above, there was only one variable, P, which we used to represent the class of palindromes over alphabet {0,1}. CFG, Lecture 5, slide 3. One of the variables represents the language being defined; it is called the start symbol. Other variables represent auxiliary classes of strings that are used to help define the language of the start symbol. In our example, P , the only variable , is the start symbol. 4. There is a finite set of productions or rules that represent the recursive definition of a language. Each production consists of: a) A variable that is being (partially) defined by the production. This variable is often called the head of the production. b) The production symbol CFG, Lecture 5, slide c) A string of zero or more terminals and variables. This string, called the body of the production, represents one way to form strings in the language of the variable of the head. In so doing, we leave terminals unchanged and substitute for each variable if the body any string that is known to be in language of that variable. CFG, Lecture 5, slide Alternate Definition of Context-Free Grammar A context-free grammar, CFG is a collection of three things: 1. An alphabet Σ of letters called terminals from which we are going to make strings that will be the words of a language. 2. A set of symbols called nonterminals, one of which is the symbol S, standing for “start here”. 3. A finite set of productions of the form. One Nonterminals finite set of terminals and/or Nonterminals CFG, Lecture 5, slide Formal Definition of CFG A context-free grammar is a 4-tuple (V, Σ, R ,S), where 1. V is finite set called the variables. 2. Σ is a finite set, disjoint from V, called the terminals. 3. R is a finite set of rules, with each rule being a variable and a string of variables and terminals, and 4. SV is the start variable. CFG, Lecture 5, slide Palindrome Example Some of the rules that define the palindromes, expressed in the context-free grammar notation, are: 1. P ^ 2. P 0 3. P 1 4. P 0P0 5. P 1P1 CFG, Lecture 5, slide Notions for CFG Derivations Some conventions used while discussing CFG’s: 1. Lower-case letters near the beginning of the alphabet, a, b, and so on, are terminal symbols. Digits and other characters such as + or parentheses can also be used as terminals. 2. Upper-case letters near the beginning of the alphabet, A, B, and so on, are variables. 3. Lower-case letters near the end of the alphabet, such as w or z, are strings of terminals. This convention reminds us that the terminals are analogous to the input symbols of an automation. 4. Upper-case letters near the end of the alphabet, such as X or Y, are either terminals or variables. CFG, Lecture 5, slide 5. Lower-case Greek letters, such as alpha and beta, are strings consisting of terminals and/or variables. There is no special notation for strings that consist of variables only, since this concept plays no important role. However, a string named alpha or another Greek letter might happen to have only variables. CFG, Lecture 5, slide Example: A complex CFG that represents (a simplification of ) expressions in a typical programming language. Operators used are limited to + and *, representing addition and multiplication respectively. Arguments act as identifiers, but instead of full set of typical identifiers (letters followed by zero or more letters and digits). The letters are a and b and the digits 0 and 1. Every identifier begins with a or b, which may be followed by any string in {a, b, 0, 1}* . CFG, Lecture 5, slide Two variables used in this grammar: 1. E which represents expressions and it represents the language of expressions we are defining. 2. I represents identifiers. The productions will be: 1. E 2. E 3. E 4. E 5. I 6. I 7. I 8. I 9. I 10. I I E+E E*E (E) a b Ia Ib I0 I1 CFG, Lecture 5, slide Suppose a string of the above CFG is a*(a+b00). Its derivations will be: E => E * E Production no. 3 => I * E Production no. 1 => a * E Production no. 5 => a * (E) Production no. 4 => a * (E + E) Production no. 2 => a * (I + E) Production no. 1 CFG, Lecture 5, slide => a * a (a + E) Production no.5 => a * a (a + I) Production no.1 => a * a (a + I0) Production no. 9 => a * a (a + I00) Production no. 9 => a * (a + b00) Production no. 6 CFG, Lecture 5, slide Leftmost and Right most Derivations Leftmost derivation: In order to restrict the number of choices we have in deriving a string, it is often useful to require that at each step we replace the leftmost variable by one of its production bodies. Such a derivation is called a leftmost derivation. Rightmost derivation: In order to restrict the number of choices we have in deriving a string, it is often useful to require that at each step we replace the rightmost variable by one of its production bodies. Such a derivation is called a rightmost derivation. CFG, Lecture 5, slide Example: The inference that a*(a+b00) is in the language of variable E can be reflected in a derivation of that string, starting with the string E. Leftmost derivation will be: E => E * E => I * E => a * E => a * (E) => a * (E + E) lm lm lm lm lm => a * ( I + E ) => a * ( a + E) => a * ( a + I) => lm lm lm lm a * ( a + I0) => a * ( a + I00) => a * ( a + b00) lm lm * a*(a+b00) or E * E * We can summarize the leftmost derivation as E => => a * (E) lm lm CFG, Lecture 5, slide Rightmost derivation will be: E rm => E * E rm => E * (E) => E * (E + E) => E * (E + I) => E * (E + I0) rm rm rm => E * ( E + I00 ) => E * (E + b00) => E * (I + b00) => rm rm rm rm E * ( a + b00) => I * ( a + b00) => a * ( a + b00) rm rm So the rightmost derivation can be expressed as E rm => a*(a+b00). CFG, Lecture 5, slide Inference, Derivations and Parse Trees I. The recursive inference procedure determines that terminal string w is in the language of variable A. II. A=>w. * III. A =>w. * lm IV. A =>w. * rm V. There is a parse tree with root A and yield w. CFG, Lecture 5, slide Some Examples: Example#1: Let the terminal be a and the nonterminal be S, and the productions be S aS S ^ The above language is a*. To derive a6 in this CFG the following derivations will be used. S => aS => aaS => aaS => aaaS => aaaaS => aaaaaS => aaaaaaS => aaaaaa^ = aaaaaa Notice: i. means “can be replaced by” as in S aS. ii. => means “can develop into” as in aaS => aaaS CFG, Lecture 5, slide Example#2: Let the terminals be a and b and the only nonterminal be S, and the productions be S S S S aS bS a b The language generated by this CFG is the set of all possible strings of letters a and b except for the null string, which we cannot generate. To produce the string baab the following derivations will be used. S => bS => baS => baaS => baab CFG, Lecture 5, slide Example#3: Let the terminals be a and b, the only nonterminal be S, and the productions be S aS S bS S a S b S ^ The word ab can be generated by the derivation S =>aS =>abS =>ab^ =ab or by the derivation S=>aS =>ab The language of this CFG is also (a+b)*, but the sequence of productions that is used to generate a specific word is not unique. The third and fourth productions are redundant. CFG, Lecture 5, slide Example#4: Let the terminals be a and b, the only nonterminal be S and X, and the productions be S XaaX X aX X bX X ^ The words generated from S have the form anything aa anything or (a+b)*aa(a+b)* which is the language of all words with a double a in them somewhere. For example, to generate baabaab, we can proceed as follows: S=>XaaX=>bXaaX=>baXaaX=>baaXaaX=>baabXaaX =>baab^aaX=>baabaaX=>baabaabX=>baabaab^=baabaab CFG, Lecture 5, slide Example#5: Let the terminals be a and b, the only nonterminal be S,X and Y and the productions be S XY X aX X bX X a Y Ya Y Yb Y a X productions are: X aX X bX X a In the preceding productions, it can be seen that: o any string of terminals that comes from X must end in an a o any words ending in an a can be derived from X CFG, Lecture 5, slide To derive the word babba from X, the procedure will be: X=>bX=>baX=>babX=>babbX=>babba Considering variable Y: Y productions are: Y Y Y Ya Yb a It can be seen that the words that can be derived from Y: o Exactly those that begin with an a To derive abbab, the procedure will be: Y=>Yb=>Yab=>Ybab=>Ybbab=>abbab CFG, Lecture 5, slide Since S XY The words that can be derived from S have a double a in them. To derive babaabb, the procedure will be: S=>XY=>bXY=>baXY=>babXY=>babaY=>babaYb=>babaYbb =>babaabb CFG, Lecture 5, slide Example#6: Let the terminals be a and b, and the three nonterminals be S, BALANCED, and UNBALANCED. The productions are: S SS S BALANCED S S S BALANCED S ^ S UNBALANCED S UNBALANCED BALANCED aa BALANCED bb UNBALANCED ab UNBALANCED ba In the preceding productions, it can be seen that: o The language generated is the set of all words with an even number of a’s and an even number of b’s i.e. the language EVEN-EVEN. CFG, Lecture 5, slide Derivation of word aababbab: S=>BALANCED S =>aaS =>aa UNBALANCED S UNBALANCED =>aa ba S UNBALANCED =>aa ba S ab =>aa ba BALANCED S ab =>aa ba bb S ab =>aa ba bb ^ ab = aababbab CFG, Lecture 5, slide Example#7: Let the terminals be a and b, and only one nonterminal S. The productions are: S S aSb ^ The language generated by these productions is the nonregular language anbn. Derivation of a6Sb6 using the above productions: S=>aSb=>aaSbb =>aaaSbbb=>aaaaSbbbb =>aaaaaSbbbbb=>aaaaaaSbbbbbb =>aaaaaabbbbbb CFG, Lecture 5, slide Example#8: Let the terminals be a and b, and only one nonterminal S. The productions are: S S S aSa bSb ^ The language generated by these productions is the nonregular language PALINDROME(a word that reads the same backwards as forwards. Derivation of word abbaabba using the above productions: S=>aSb=>aaSbb =>aaaSbbb=>aaaaSbbbb =>aaaaaSbbbbb=>aaaaaaSbbbbbb =>aaaaaabbbbbb CFG, Lecture 5, slide Derivation of word abbaabba using the above productions: S =>aSa =>abSba =>abbSbba =>abbaSabba =>abbaabba CFG, Lecture 5, slide Example#9: ODD PALINDROME language is the language containing odd number of letters in words. To convert a general palindrome(which can contain both even and odd letters). Grammar for ODD PALINDROME is: S => aSa S => bSb S => a S => b The above grammar can be modified to be the entire languae PALINDROME as: S => aSa S => bSb S => a S => b S => ^ CFG, Lecture 5, slide Example#10: A nonregular language that can be generated by CFG is anban. S => aSa S => b CFG, Lecture 5, slide Example#11: Let the terminals be a and b, the nonterminals be S, A, and B, and the productions be S S A A A B B B aB bA a aS bAA b bS aBB The language that this CFG generates is the language EQUAL of all strings that have an equal number of a’s and b’s in them. Some words of this language are abba, aaabbb, and ba. CFG, Lecture 5, slide Ambugity Definition: A CFG is called ambiguous if for at least one word in the language that it generates there are two possible derivations of the word that correspond to different syntax trees. If a CFG is not ambiguous, it is called unambiguous. Ambiguous Grammars: Consider the form E + E * E. It has two derivations from E. 1. E=> E + E => E + E * E 2. E=> E * E => E + E * E CFG, Lecture 5, slide E E E + E E * E E fig. I * E E + E fig. II Two parse trees with the same yield CFG, Lecture 5, slide Removing Ambiguity from Grammars There are two causes of ambiguity in the previous ambiguous grammar: I. The precedence of operators is not respected. While fig. I properly groups the * before the + operator, fig. II is also a valid parse tree and groups the + ahead of the *. We need to force only the structure of fig. I to be legal in an unambiguous grammar. II. A sequence of identical operators can group either from the left or from the right. For example, if the *’s in fig(I and II) were replaced by +’s, we would see two different parse trees for the string E + E + E. Since addition and multiplication are associative, it doesn’t matter whether we group from the left or the right, but to eliminate ambiguity, we must pick one. The conventional approach is to insist on grouping from the left, so the structure of fig. II is the only correct grouping of two +-signs CFG, Lecture 5, slide