Context Free Languages October 2, 2001 1

advertisement
Context Free Languages
October 2, 2001
1
Announcement
HW 3 due now
2
Agenda
Context Free Grammars

Derivations
Parse-trees
CFGParse

Ambiguity


3
Motivating Example
pal = { x{0,1}* | x =x R }
= { e,
0,1,
00,11,
000,010, 101,111,
0000,0110,1001,1111, … }
Saw last time that pal non-regular.
But there is a pattern.
Q: If you have one palindrome, how can you
generate another?
4
Motivating Example
pal = { x{0,1}* | x =x R }
A: Can generate pal recursively as
follows.
BASE CASE: e, 0 and 1 are palindromes
RECURSE: if x is a palindrome, then so
are 0x 0 and 1x 1.
5
Motivating Example
pal = { x{0,1}* | x =x R }
SUMMARY: In pal any x can be replaced by
any one of {e, 0, 1, 0x 0, 1x 1}.
NOTATION: x  e|0|1|0x 0|1x 1
Each pipe “|” is an or, just ad in UNIX regexp’s.
In fact, all elements of pal can be generated
from e by using these rules.
Q: How would you generate 11011011 starting
from the variable x ?
6
Motivating Example
A: Generate the string from outside-in
11011011 = 1(1(0(1()1)0)1)1
so that:
x  1x1  11x11  110x 011
1101x1011  1101e1011 =
11011011
7
Context Free Grammars.
Definition
DEF: A context free grammar consists of
(V, S, R, S ) with:




V –a finite set of variables (or symbols, or nonterminals)
S –a finite set set of terminals (or the alphabet
)
R –a finite set of rules (or productions) of the
form v  w with vV, and
w(SV )*
(read “v yields w ” or “v produces w ” )
S V –the start symbol.
Q: What are (V, S, R, S ) for pal ?
8
Context Free Grammars.
Definition
A:
V = {x}
S = {0,1},
R = {xe, x0, x1, x0x 0, x1x 1}
S=x
Standardize pal, reset:
V = {S }, S = S
R = {Se, S0, S1, S0S 0, S1S 1}
9
Derivations
DEF: The derivation symbol “”, read
“1-step derives” or “1-step produces” is
a relation between strings in (SV )*.
We write x y if x and y can be broken
up as x = svt and y = swt with v w
a production in R.
Q: What possible y satisfy (in pal)
S 0000S y ?
10
Derivations
A: S 0000S y :
Any one of:
0000S, 00000S, 10000S, 0S00000S,
1S10000S, S0000, S00000, S00001,
S00000S0, S00001S1
11
Derivations
DEF: The derivation symbol “*”, read
“derives” or “produces” or “yields” is a
relation between strings in (SV )*. We
write x *y if there is a sequence of 1-step
productions from x to y. I.e., there are
strings xi with i ranging from 0 to n such that
x = x0, y = xn and
x0  x1, x1  x2, x2  x3, … , xn-1  xn
Q: Which of LHS’s yield which of RHS’s in pal?
01S, SS ? 01, 0, 01S, 01110111, 0100111
12
Derivations
A: Not all answers are unique.





01S * 01
SS * 0
01S * 01S
SS * 01110111
Nothing yields 0100111
13
Language Generated by a CFG
DEF: Let G be a context free grammar.
The language generated by G is the
set of all terminal strings which are
derivable from the start symbol.
Symbolically:
L(G ) = {w  S* | S * w}
14
Example. Infix Expressions
Infix expressions involving {+, , a, b, c, (, )}
E stands for an expression (most general)
F stands for factor (a multiplicative part)
T stands for term (a product of factors)
V stands for a variable, I.e. a, b, or c
Grammar is given by:
E  T | E+ T
T  F | T F
F  V | (E )
Va |b |c
CONVENTION: Start variable is the first one (E)15
Example. Infix Expressions
EG: Consider the string u given by
a  b + (c + (a + c ) )
This is a valid infix expression. Should be able
to generated it from E.
1. A sum of two expressions, so first
production must be E  E +T
2. Sub-expression a b is a product, so a term
so generated by sequence E +T  T +T 
T F +T * a b +T
3. Second sub-expression is a factor only
because a parenthesized sum. a b +T 
ab +F  ab +(E )  ab +(E +T )
16
Example. Infix Expressions
Continuing on in this fashion and summarizing:
E  E +T  T +T  T  F + T  F  F +T
 V F+T  aF+T  aV+T  a b +T
 a  b + F  a  b + ( E )  a  b + (E + T )
 ab+(T+T)  ab+(F+T) ab+(V+T )
 a b + (c +T )  a b + (c +F )
 a b + (c + (E ) )  a b + (c +(E +T ))
 ab+(c+(T+T ) )  a b +(c + (F +T ) )
 ab+(c+(a+T ) )  a b +(c +(a + F ) )
 ab+(c+(a+V ) )  a b + (c + (a+c ) )
This is a so-called left-most derivation.
17
Right-most derivation
In a right-most derivation, the variable most to
the right is replaced.
E  E +T  E + F  E + (E )  E + (E +T )
 E + (E +F )  etc.
There is a lot of ambiguity involved in how a
string is derived. However, if decide that
derivations are left-most, or right-most,
each derivation is not unique.
Another way to describe a derivation in a
unique way is using derivation trees.
18
Derivation Trees
In a derivation tree (or parse tree) each
node is symbol. Each parent is a variable
whose children spell out the production from
left to right. For, example v  abcdefg:
v
a
b
c
d
e
f
g
The root is the start variable. The leaves spell
out the derived string from left to right.
19
Derivation Trees
EG, a derivation tree for
a  b + (c + (a + c ) )
Advantage. Derivation trees
also help understanding
semantics! You can tell how
expression should be
evaluated from the tree.
20
CFGParse
CFGParse is a tool that I wrote that helps you
play with context free grammars and create
derivation trees. Yoav Hirsch (TA) also added
some functionality that allows you to convert
CFG’s into various forms that we’ll learn about
next week.
USAGE:
java CFGParse <grammar-file> <input-string>
pdflatex <LaTeX-tree-filename>
Second command only works on CUNIX, unless
install LaTeX plus some libraries (ecltree).
21
Ambiguity
<sentence>
<action>


<action>|<action>with<subject>
<subject><activity>
<subject>
<activity>


<noun>| <noun>and<subject>
<verb>| <verb><object>
<noun>

<verb>

<prep>
<object>


Hannibal | Clarice | rice | onions
ate | played
with | and | or
<noun>|<noun><prep><object>
Clarice played with Hannibal
Clarice ate rice with onions
Hannibal ate rice with Clarice
Q: Are there any suspect sentences?
22
Ambiguity
A: Consider “Hannibal ate rice with Clarice”.
Could either mean
Hannibal and Clarice ate rice together.
Hannibal ate rice and ate Clarice.
And this is not absurd, given what we know about
Hannibal!
This ambiguity arises from the fact that the
sentence has two different parse-trees, and
therefore two different interpretations:
23
Hannibal and Clarice Ate
sentence
action
subject
w
i
t
h
subject
activity
noun
H a n n
i
noun
verb object
b a
l
a
t
e
C
r
l
i
a
c
r
i
c
e
24
e
Hannibal the Cannibal
sentence
action
subject
activity
noun
H a n n
i
verb object
b a
l
r
a
i
t
c
e
e
noun
prep
object
w
t
noun
i
C
l
h
a
r
i
c
25
e
Ambiguity.
Definition
DEF: A string x is said to be ambiguous relative
the grammar G if there are two essentially
different ways to derive x in G. I.e. x
admits two (or more) different parse-trees
(equivalently, x admits different left-most
[resp. right-most] derivations). A grammar G
is said to be ambiguous if there is some string
x in L(G ) which is ambiguous.
Q: Is the grammar S  ab | ba | aSb | bSa |SS
ambiguous? What language is generated?
26
Ambiguity
A: L(G ) = the language with equal no.
of a’ s and b’ s
Yes, the language is ambiguous:
S
S
S
S
a
a
b
b
a
a
b
b
S
S
S
S
S
27
CFG’s
Proving Correctness
The recursive nature of CFG’s means that they
are especially amenable to correctness
proofs.
For example let’s consider the grammar
G = ( S  e | ab | ba | aSb | bSa | SS )
with purported generated language
L(G ) = { x  {a,b}* | na(x) = nb(x) }.
Here na(x) is the number of a’s in x, and nb(x) is
the number of b’s.
28
CFG’s
Proving Correctness
Proof. There are two parts. Let L be the
purported language generated by G. We
want to show that L = L(G ). The usual
I.
II.
way to prove that sets are equal is to
show both inclusions:
L  L(G ). Every string with purported can
be generated by G.
L  L(G ). G only generate strings of the
purported pattern.
29
Proving Correctness
L  L(G )
I. L  L(G ): Show that every string x with the
same number of a’s as b’s is generated by
G. Prove by induction on the length n = |x|.
Base case: n = 0. So x is empty so derived by
the production S  e.
Inductive hypothesis: Assume n > 0. Let u be
the smallest non-empty prefix of x which is
also in L. There are two case:
1) u = x
2) u is a proper prefix
30
Proving Correctness
L  L(G )
I.1) u = x : Notice that u can’t start and end in the
same letter. E.g., if it started and ended with a then
write u = ava. This means that v must have 2 more
b’s than a’s. So somewhere in v the b’s of u catch
up to the a’s which means that there’s a smaller
prefix in L, contradicting the definition of u as the
smallest prefix in L. Thus for some string v in L we
have
u = avb
OR u = bva
The length of v is shorter than u so by induction we can
assume that a derivation S * v exists. Therefore
u can be derived by one of:
S aSb * avb = x
OR
S aSb * avb = x
31
Proving Correctness
L  L(G )
I.2) u is a proper prefix. So can write
x = uv with both u and v in L
As each of u, v are shorter than x, can
apply inductive hypothesis to assume
that S *u and S *v. To derive x
just use:
S  SS * uS * uv = x
32
Proving Correctness
L  L(G )
II)
Everything derivable from L conforms to the
pattern. We prove something more general: Any
string in {S,a,b}* derivable from S contains the
same number of a’s as b’s. Again induction is
used, but this time on the number of steps n in
the derivation.
Base case n = 0: The only thing derivable in 0 steps is
S which has 0 a’s and 0 b’s. OK!
Intuctive step: Assume that u is any string derivable
from S in n steps. I.e., S *u. Any further step
would have to utilize on of S  e | ab | ba | aSb |
bSa | SS. But each of these productions
preserves the relative number of a’s vs. b’s. Thus
any string derivable from S in n+1 steps must
33
also have the same number of a’s as b’s. //QED
CFG Exercises
On black-board
34
Download