Transformation schemes for context-free grammars structural, algorithmic, linguistic applications Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014 Overview • CFG- Devices producing strings & their derivation trees (with weights) • Top down schemes transforming the grammars • Driven by rotations operations-tree (BOT) • Preserving derivation trees, semi-ring weights Enhancing: property tests , parsing & optimal tree 2 algorithms: time down to O(n ), space to O(n). • Decomposition of bounded ambiguity grammars (Sam Eilenberg’s question [SE]) • Non-expansive [NE] (quasi-rational) grammars • Implications to NLP, sequence alignment, … Schemes - simple to subtle • Chomsky’s normal form (CNF) • Elimination of redundant symbols, ε rules • Greibach’s normal form (GNF) (subtle) all rules are ATx. T terminal (lexicalization) GNF destroys derivation trees, however has many applications (structural…) Schemes for sub-classes of CFG (in parsing technology) deterministic, LR(k)… Context Free Basics 1 Such a grammar G = (V,T,P,S = root) is a well known model to derive/generate a set of terminal strings in T G defines a derivation relation between strings overV UT: One step xy: y is obtained from x by rewriting a single occurrence of some A by B1..Bk when A B1..Bk is production rule in P. Several steps x y if x x1 …y LA(G) = {wεT | Aw}, L(G)=LS(G), the language generated by G. A derivation is best described by a labeled tree in which the k sons of a node labeled A are labeled B1, .., Bk. Context Free Basics 2 Ambiguity-deg (Aw) = {number of distinct trees for (Aw), deg (GA)= max deg of (Aw). A - B - defines a partial order on V U T, denoted A>B. it induces a complete order on any branch of a derivation tree. B in G is pumping if B>B'>B. Then B' is also pumping; both belong to the pumping equivalence class [B]. Node Type and Spread Lemma (i) B Pumping, (ii) C pre-terminal – if NOT {C > B, B pumping} (iii) D spread – D is not pumping but D>B, B pumping. SPREAD LEMMA 1. Pre-terminal C derives a bounded number of bounded terminal strings. 2. In each derivation tree a spread node D derives a bounded sub-tree the leaves of which are terminals or pump nodes. 3. In G, each spread symbol D derives the bounded number of sub-trees, as mentioned in 2. Non Expansive Grammars G is non-expansive (NE) if no production rule has the form B -B'-B''- where the B's are from the same pumping class Equivalently, no derivation B —B—B— is possible (sideway pumping is forbidden!). NE is the quasi- rational class, the substitution closure of linear grammars[1]. Our BOT scheme simplifies proofs of its known properties and new ones (parsing speed). Bounded Operation Tree (BOT) BOT Tree-nodes are labeled by: • Current grammar as a product Π=P1…Pk • Current operation SPREAD / CYC / TTR (Depending on the type of the root of P1 or Pk) Determines the children nodes and their labels Root of BOT= #G, Leaves of BOT - linear G(i) Main Claim: each derivation tree for w w.r. to #G is mapped onto derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V. SPREAD / CYC / TTR Operations Type=SPREAD: Pk is split to U Q(j), the current grammar at j’th child is P1…Pk-1 * Q(j) Type=CYC: Pk is terminal, the (effective) current grammar at the single child is Pk P1…Pk-1 Type=TTR, if the root of Pk is pumping: let M= P1…Pk-1 , N=Pk, the top trunk of N is rotated by 180° and mounted on M, so MN M*N^ Top Trunk Rotation of MN to (M*N^) N for trees: M* x1 M 180 y1 x2 y2 x2 y2 x1 y1 EXIT M N^ for strings: m x1x2 … n^ …y2y1 Figure 1.1 …y2y1 m x1x2 … n^ N^ Figure 1.2: TTR For grammars: N grammar (top trunk) BB’C BDB’ BB^, B^α M* grammar B’CB B’BD B root(M), root (M) α All other productions carry over from N to M*; those of M unchanged. The TTR rotation is invertible, one-one onto for the derivation trees, preserving ambiguity in ‘cyclic rotated’ sense. Termination and Correctness TTR operations dominate the BOT scheme for NE grammars. The E-depth of N^ and of the two sides of the mounted trunk must decrease. The M* factors become taller and thinner until they become linear G(i). [without spread symbols] Claim: each derivation tree for w w.r. to #G is mapped to derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V. ƍw = CYCLIC rotation of w. Holds for each SPREAD/CYC/TTR step! Tabular Dynamic Prog. For parsing G (CYK/ Earley algorithm for terminal w of length n the table extends to items of rotated intervals [i+1, i+k (mod n), A BC], at the same cost. For linear G(i) total time cost is only O(n2) Space cost is O(n): one or few diagonals of width near k are kept in memory with pointers to few neighbors, enabling table reconstruction. • Just membership, or total weight algorithm, is in the parallel class NC(1), as for finite-state transductions. Example (from [4]) (u I u )R (v JRv), u , vε {0.1}* = I = J u =R reversal of u, • It has unbounded "direct (product) ambiguity" which increases the time in Earley algorithm. But after one TTR step MN is rotated to • (M*)(N^) = (v u I u vR )R (J) , which has a linear grammar, (of unbounded ambiguity degree) And all product ambiguity trees are rotated to union of trees for the linear M*N^. • (M)(N) = Decomposing Bounded Ambiguity SE Claim: Ambiguity-deg(G)= l < ∞. Then L (G) is a bounded-size union of languages of deg 1-grammars. This provides a positive answer to a question Sam Eilenberg posed, c. 1970. "Bounded size" means polynomial in |G|, the size of the grammar G, and l. Expansive G and Ambiguity G expansive each pump symbol has ambiguity - degree=1 or unbounded (exponential in length) B==> --B—B—B--… B--… (k times) k If degB ≥ 2 then degB ≥ 2 This is a corner stone in the proof of SE • Extending ambiguity to cyclic-closed strings is helpful (cf last slides) Proof of SE We briefly sketch the scheme for proving the claim. Starting with # G, and using the SPREAD LEMMA, the claim is reduces to: LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l < ∞, N(i) are terminals or with pump roots then L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded. It suffices to prove it for a pair, starting with MN(1), after which M(j)N(2) are decomposed, and so on. Proof of SE (2) For a pair MN the operation TTR is used transforming it to M*N^. Now deg M* < l and its ambiguity must be concentrated along the top trunk which it got from N. An easy direct argument shows it decomposes into a bounded union of M(j) of deg 1. As for N^ its E-depth is smaller than that of N. so for M(j)N^ we can use induction on the E-depth of the second factor or, more explicitly, continue the recursive descent on N^ until it is consumed. Approximate G by NE G’ • Easy to achieve by duplicating symbols of the pumping classes. • Makes linguistic sense • Advantages of NE G’ using the BOT scheme view the linear G’i as finite-state transactions: powerful tool in several linguistic fields • Applications to Bio-informatic (stringology)? • Extension of NE condition to mildly contextsensitive models (LIG, TAG…)? The Hardest Context Free Grammar The concept is due to S. Greibach. The simplest reduction is based on Shamir's homomorphism theorem([1]), mapping each b in T into a finite set φ(b) of strings over the vocabulary of the Dyck language and claiming that w is in L(G) if and only if φ(w) contains a string in the Dyck language (see the description in [1]). In fact, the categorical grammar model in the 1960 article ([2]) provides another homomorphism which makes it a hardest CFG. However, those hardest CFG languages are inherently expansive. Indeed, an NE candidate grammar for Dyck will be negated by its BOT scheme, upon using local pump-shrinks, which for linear grammars can operate near any point of the (sufficiently long) main branch of non- terminals. We conjecture that any hardest CFG must be expansive. Note that finding a non-expansive one would entail O(n2 ) complexity of membership test for any context free grammar. Ambiguity and Cyclic Rotation Ambiguity in natural languages can be resolved (or created) by cyclic rotation. Consider the bible verse in book of Job chapter 6 verse 14 (six Hebrew words). Translated to English: "a friend should extend# mercy to the sufferer $ , even if he abandons God's fear." • The ambiguity here is anaphoric, does the pronoun "he" refer to the sufferer or to the friend? The poetic beautiful answer is: to both. • The rotated sentences, starting at the symbols # and $, resolve the ambiguity one way or the other. • Politically loaded example: the policeman shot# the boy $ with the gun References 1. J. Autebert, J. Berstel and L. Boasson, Context-free language and pushdown automata. Chap. 3 In: handbook of formal languages Vol 1. G. Rozenberg and A. Salomaa (eds.), Springer-Verlag 1997. 2. Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical and phrase structure grammars. Bulletin research council of Israel, vol. 9f (1960), 1-16. 3. S. Greibach. The hardest context-free language. SIAM J. on computing 3 (1973), 304-310. 4. E. Shamir. Some inherently ambiguous context-free languages. Inf. and Control 18 (1971), 355-363.