Transformation Schemes for Context

advertisement
Transformation schemes for
context-free grammars
structural, algorithmic, linguistic applications
Eli Shamir
Hebrew university of Jerusalem, Israel
ISCOL
Haifa university
September 2014
Overview
• CFG- Devices producing strings & their derivation
trees (with weights)
• Top down schemes transforming the grammars
• Driven by rotations operations-tree (BOT)
• Preserving derivation trees, semi-ring weights
Enhancing: property tests , parsing & optimal tree
2
algorithms: time down to O(n ), space to O(n).
• Decomposition of bounded ambiguity grammars
(Sam Eilenberg’s question [SE])
• Non-expansive [NE] (quasi-rational) grammars
• Implications to NLP, sequence alignment, …
Schemes - simple to subtle
• Chomsky’s normal form (CNF)
• Elimination of redundant symbols, ε rules
• Greibach’s normal form (GNF) (subtle)
all rules are ATx. T terminal (lexicalization)
GNF destroys derivation trees, however has
many applications (structural…)
Schemes for sub-classes of CFG (in parsing
technology) deterministic, LR(k)…
Context Free Basics 1
Such a grammar G = (V,T,P,S = root) is a well known model
to derive/generate a set of terminal strings in T G
defines a derivation relation between strings overV UT:
One step xy: y is obtained from x by rewriting a
single occurrence of some A by B1..Bk when
A  B1..Bk is production rule in P.
Several steps x  y if x  x1 …y
LA(G) = {wεT | Aw}, L(G)=LS(G), the language
generated by G.
A derivation is best described by a labeled tree in which
the k sons of a node labeled A are labeled B1, .., Bk.
Context Free Basics 2
Ambiguity-deg (Aw) = {number of distinct
trees for (Aw),
deg (GA)= max deg of (Aw).
A - B - defines a partial order on V U T,
denoted A>B. it induces a complete order on
any branch of a derivation tree.
B in G is pumping if B>B'>B. Then B' is also
pumping; both belong to the pumping
equivalence class [B].
Node Type and Spread Lemma
(i) B Pumping,
(ii)
C pre-terminal – if
NOT {C > B, B pumping}
(iii) D spread – D is not pumping but D>B, B pumping.
SPREAD LEMMA
1. Pre-terminal C derives a bounded number of bounded
terminal strings.
2. In each derivation tree a spread node D derives a
bounded sub-tree the leaves of which are terminals or
pump nodes.
3. In G, each spread symbol D derives the bounded
number of sub-trees, as mentioned in 2.
Non Expansive Grammars
G is non-expansive (NE) if no production rule
has the form B  -B'-B''- where the B's are
from the same pumping class Equivalently, no
derivation B  —B—B—
is possible
(sideway pumping is forbidden!).
NE is the quasi- rational class, the substitution
closure of linear grammars[1]. Our BOT
scheme simplifies proofs of its known
properties and new ones (parsing speed).
Bounded Operation Tree (BOT)
BOT Tree-nodes are labeled by:
• Current grammar as a product Π=P1…Pk
• Current operation SPREAD / CYC / TTR
(Depending on the type of the root of P1 or Pk)
Determines the children nodes and their labels
Root of BOT= #G, Leaves of BOT - linear G(i)
Main Claim: each derivation tree for w w.r. to #G
is mapped onto derivation tree for ƍw w.r. to
some G(i), (with the same weight) and V.V.
SPREAD / CYC / TTR Operations
Type=SPREAD: Pk is split to U Q(j), the current
grammar at j’th child is P1…Pk-1 * Q(j)
Type=CYC: Pk is terminal, the (effective) current
grammar at the single child is Pk P1…Pk-1
Type=TTR, if the root of Pk is pumping: let
M= P1…Pk-1 , N=Pk, the top trunk of N is rotated
by 180° and mounted on M, so MN  M*N^
Top Trunk Rotation of MN to (M*N^)
N
for trees:
M*
x1
M
180
y1
x2
y2
x2
y2
x1
y1
EXIT
M
N^
for strings:
m x1x2 … n^ …y2y1
Figure 1.1
…y2y1 m x1x2 … n^
N^
Figure 1.2: TTR For grammars:
N grammar (top trunk)
BB’C
BDB’
BB^, B^α
M* grammar
B’CB
B’BD
B root(M), root
(M) α
All other productions carry over from N to M*;
those of M unchanged.
The TTR rotation is invertible, one-one onto for
the derivation trees, preserving ambiguity in
‘cyclic rotated’ sense.
Termination and Correctness
TTR operations dominate the BOT scheme for NE
grammars. The E-depth of N^ and of the two
sides of the mounted trunk must decrease. The
M* factors become taller and thinner until they
become linear G(i). [without spread symbols]
Claim: each derivation tree for w w.r. to #G is
mapped to derivation tree for ƍw w.r. to some
G(i), (with the same weight) and V.V.
ƍw = CYCLIC rotation of w.
Holds for each SPREAD/CYC/TTR step!
Tabular Dynamic Prog. For parsing G
(CYK/ Earley algorithm for terminal w of length n
the table extends to items of rotated intervals
[i+1, i+k (mod n), A BC], at the same cost.
For linear G(i) total time cost is only O(n2)
Space cost is O(n): one or few diagonals of width
near k are kept in memory with pointers to
few neighbors, enabling table reconstruction.
• Just membership, or total weight algorithm, is
in the parallel class NC(1), as for finite-state
transductions.
Example (from [4])
(u I u )R (v JRv), u , vε {0.1}* = I = J
u =R reversal of u,
• It has unbounded "direct (product) ambiguity"
which increases the time in Earley algorithm. But
after one TTR step MN is rotated to
• (M*)(N^) =
(v u I u vR )R (J) , which has a
linear grammar, (of unbounded ambiguity degree)
And all product ambiguity trees are rotated to
union of trees for the linear M*N^.
• (M)(N) =
Decomposing Bounded Ambiguity
SE Claim: Ambiguity-deg(G)= l < ∞. Then L (G) is
a bounded-size union of languages of
deg 1-grammars. This provides a positive
answer to a question Sam Eilenberg posed, c.
1970.
"Bounded size" means polynomial in |G|, the
size of the grammar G, and l.
Expansive G and Ambiguity
G expansive each pump symbol has ambiguity
- degree=1 or unbounded (exponential in
length)
B==> --B—B—B--… B--… (k times)
k
If degB ≥ 2 then degB ≥ 2
This is a corner stone in the proof of SE
• Extending ambiguity to cyclic-closed strings is
helpful (cf last slides)
Proof of SE
We briefly sketch the scheme for proving the claim.
Starting with # G, and using the SPREAD LEMMA,
the claim is reduces to:
LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l
< ∞, N(i) are terminals or with pump roots then
L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded.
It suffices to prove it for a pair, starting with MN(1),
after which M(j)N(2) are decomposed, and so on.
Proof of SE (2)
For a pair MN the operation TTR is used
transforming it to M*N^. Now deg M* < l and its
ambiguity must be concentrated along the top
trunk which it got from N. An easy direct
argument shows it decomposes into a bounded
union of M(j) of deg 1.
As for N^ its E-depth
is smaller than that of N. so for M(j)N^ we can
use induction on the E-depth of the second factor
or, more explicitly, continue the recursive descent
on N^ until it is consumed.
Approximate G by NE G’
• Easy to achieve by duplicating symbols of the
pumping classes.
• Makes linguistic sense
• Advantages of NE G’ using the BOT scheme
view the linear G’i as finite-state transactions:
powerful tool in several linguistic fields
• Applications to Bio-informatic (stringology)?
• Extension of NE condition to mildly contextsensitive models (LIG, TAG…)?
The Hardest Context Free Grammar
The concept is due to S. Greibach. The simplest
reduction is based on Shamir's homomorphism
theorem([1]), mapping each b in T into a finite set
φ(b) of strings over the vocabulary of the Dyck
language and claiming that w is in L(G) if and only
if φ(w) contains a string in the Dyck language (see
the description in [1]).
In fact, the categorical grammar model in the 1960
article ([2]) provides another homomorphism
which makes it a hardest CFG.
However, those hardest CFG languages are
inherently expansive. Indeed, an NE candidate
grammar for Dyck will be negated by its BOT
scheme, upon using local pump-shrinks, which
for linear grammars can operate near any
point of the (sufficiently long) main branch of
non- terminals.
We conjecture that any hardest CFG must be
expansive. Note that finding a non-expansive
one would entail O(n2 ) complexity of
membership test for any context free
grammar.
Ambiguity and Cyclic Rotation
Ambiguity in natural languages can be resolved (or
created) by cyclic rotation. Consider the bible verse
in book of Job chapter 6 verse 14 (six Hebrew
words). Translated to English: "a friend should
extend# mercy to the sufferer $ , even if he
abandons God's fear."
• The ambiguity here is anaphoric, does the pronoun
"he" refer to the sufferer or to the friend? The
poetic beautiful answer is: to both.
• The rotated sentences, starting at the symbols #
and $, resolve the ambiguity one way or the other.
• Politically loaded example:
the policeman shot# the boy $ with the gun
References
1. J. Autebert, J. Berstel and L. Boasson, Context-free
language and pushdown automata. Chap. 3 In:
handbook of formal languages Vol 1. G. Rozenberg and
A. Salomaa (eds.), Springer-Verlag 1997.
2. Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical
and phrase structure grammars. Bulletin research
council of Israel, vol. 9f (1960), 1-16.
3. S. Greibach. The hardest context-free language. SIAM J.
on computing 3 (1973), 304-310.
4. E. Shamir. Some inherently ambiguous context-free
languages. Inf. and Control 18 (1971), 355-363.
Download