Parsing with Semidirectional Lambek Grammar is NP

advertisement
Parsing for Semidirectional Lambek Grammar is N P - C o m p l e t e
Jochen Dfrre
I n s t i t u t ffir m a s c h i n e l l e S p r a c h v e r a r b e i t u n g
U n i v e r s i t y of S t u t t g a r t
Abstract
Figure 1 illustrates this proof-logical behaviour. Notice that this natural-deduction-style proof in the
type logic corresponds very closely to the phrasestructure tree one would like to adopt in an analysis
with traces. We thus can derive B i l l misses ~ as
an s from the hypothesis that there is a "phantom"
np in the place of the trace. Discharging the hypothesis, indicated by index 1, results in B i l l misses
being analyzed as an s/np from zero hypotheses. Observe, however, that such a bottom-up synthesis of a
new unsaturated type is only required, if that type
is to be consumed (as the antecedent of an implication) by another type. Otherwise there would be
a simpler proof without this abstraction. In our example the relative pronoun has such a complex type
triggering an extraction.
We study the computational complexity
of the parsing problem of a variant of
Lambek Categorial Grammar that we call
semidirectional. In semidirectional Lambek
calculus SD[ there is an additional nondirectional abstraction rule allowing the
formula abstracted over to appear anywhere in the premise sequent's left-hand
side, thus permitting non-peripheral extraction. SD[ grammars are able to generate each context-free language and more
than that. We show that the parsing problem for semidireetional Lambek Grammar
is NP-complete by a reduction of the 3Partition problem.
A drawback of the pure Lambek Calculus !_ is that it
only allows for so-called 'peripheral extraction', i.e.,
in our example the trace should better be initial or
final in the relative clause.
K e y words: computational complexity,
Lambek Categorial Grammar
1
This inflexibility of Lambek Calculus is one of the
reasons why many researchers study richer systems
today. For instance, the recent work by Moortgat
(Moortgat 94) gives a systematic in-depth study of
mixed Lambek systems, which integrate the systems
L, NL, NLP, and LP. These ingredient systems are
obtained by varying the Lambek calculus along two
dimensions: adding the permutation rule (P) and/or
dropping the assumption that the type combinator
(which forms the sequences the systems talk about)
is associative (N for non-associative).
Introduction
Categorial Grammar (CG) and in particular Lambek
Categorial Grammar (LCG) have their well-known
benefits for the formal treatment of natural language
syntax and semantics. The most outstanding of these
benefits is probably the fact that the specific way,
how the complete grammar is encoded, namely in
terms of 'combinatory potentials' of its words, gives
us at the same time recipes for the construction of
meanings, once the words have been combined with
others to form larger linguistic entities. Although
both frameworks are equivalent in weak generative
capacity - - both derive exactly the context-free languages --, LCG is superior to CG in that it can cope
in a natural way with extraction and unbounded dependency phenomena. For instance, no special category assignments need to be stipulated to handle a
relative clause containing a trace, because it is analyzed, via hypothetical reasoning, like a traceless
clause with the trace being the hypothesis to be discharged when combined with the relative pronoun.
Taken for themselves these variants of I_ are of little use in linguistic descriptions. But in Moortgat's
mixed system all the different resource management
modes of the different systems are left intact in the
combination and can be exploited in different parts
of the grammar. The relative pronoun which would,
for instance, receive category (np\np)/(np --o s)
with --o being implication in LP, 1 i.e., it requires
1The Lambek calculus with permutation I_P is also
called the "nondirectional Lambek calculus" (Benthem 88). In it the leftward and rightward implication
95
misses
e
(n;\8)/n;
Bill
~
np
(the book)
~
np\s
8
I
which
s/npl
(np\np)/(s/np)
np\np
Figure 1: Extraction as resource-conscious hypothetical reasoning
2
as an argument "an s lacking an np somewhere" .2.
The present paper studies the computational complexity of a variant of the Lambek Calculus that lies
between / and t P , the Semidirectional Lambek Calculus SDk. 3 Since t P derivability is known to be NPcomplete, it is interesting to study restrictions on the
use of the I_P operator -o. A restriction t h a t leaves
its proposed linguistic applications intact is to a d m i t
a type B - o A only as the argument type in functional applications, but never as the functor. Stated
prove-theoretically for Gentzen-style systems, this
amounts to disallowing the left rule for -o. Surprisingly, the resulting system SD[. can be stated without the need for structural rules, i.e., as a monolithic
system with just one structural connective, because
the ability of the abstracted-over formula to permute
can be directly encoded in the right rule for --o. 4
2.1
Semidirectional Lambek Grammar
Lambek calculus
The semidirectional L a m b e k calculus (henceforth
SDL) is a variant of J. L a m b e k ' s original (Lambek 58) calculus of syntactic types. We start by
defining the L a m b e k calculus and extend it to obtain SDL.
Formulae (also called "syntactic types") are built
from a set of propositional variables (or "primitive
types") B = {bl, b2,...} and the three binary connectives • , \ , / , called product, left implication, and
right implication. We use generally capital letters A,
B, C , . . . to denote formulae and capitals towards the
end of the alphabet T, U, V, . . . to denote sequences
of formulae. The concatenation of sequences U and
V is denoted by (U, V).
Note that our purpose for studying SDI_ is not that
it might be in any sense better suited for a theory of
g r a m m a r (except perhaps, because of its simplicity),
but rather, because it exhibits a core of logical behaviour that any richer system also needs to include,
at least if it should allow for non-peripheral extraction. The sources of complexity uncovered here are
thus a forteriori present in all these richer systems
as well.
The (usual) formal framework of these logics is a
Gentzen-style sequent calculus. Sequents are pairs
(U, A), written as U =~ A, where A is a type and U
is a sequence of types. 5 The claim embodied by sequent U =~ A can be read as "formula A is derivable
from the structured database U". Figure 2 shows
L a m b e k ' s original calculus t.
First of all, since we don't need products to obtain
our results and since they only complicate matters,
we eliminate products from consideration in the sequel.
collapse.
2Morrill (Morrill 94) achieves the same effect with a
permutation modality /k apphed to the np gap: (s/Anp)
SThis name was coined by Esther K6nig-Baumer, who
employs a variant of this calculus in her LexGram system
(KSnig 95) for practical grammar development.
4It should be pointed out that the resource management in this calculus is very closely related to the handhng and interaction of local valency and unbounded
dependencies in HPSG. The latter being handled with
set-valued features SLASH, QUE and KEL essentially emulates the permutation potential of abstracted categories
in semidirectional Lambek Grammar. A more detailed
analysis of the relation between HPSG and SD[ is given
in (KSnig 95).
In Semidirectional L a m b e k Calculus we add as additional connective the [_P implication --% but equip
it only with a right rule.
U, B, V :=~ A
T :~ B --o A
( - o R) if T = (U, Y) nonempty.
5In contrast to Linear Logic (Girard 87) the order
of types in U is essential, since the structural rule of
permutation is not assumed to hold. Moreover, the fact
that only a single formula may appear on the right of ~ ,
make the Lambek calculus an intuitionistic fragment of
the multiplicative fragment of non-commutative propositional Linear Logic.
96
(Ax)
T~B
U,A,V=~C
U, A / B , T, V =~ C
(/L)
U,B ~ A
U ::~ A / B
(/1~) if U nonempty
T ::v B
U,A, V =v C
U, T, B \ A , V =~ C
(\L)
B,U~A
U =~ B \ A
(\R) if U nonempty
UsA
V~B
U,V =~ A . B
U,A,B, V =~ C (.L)
U, A o B , V =~ C
T~A
(.R)
U,A,V=¢,C (Cut)
U, T, V =~ U
Figure 2: Lambek calculus L
trary type A, to be
Let us define the polarity of a subformula of a sequent A1, • •., Am ::~ A as follows: A has positive polarity, each of Ai have negative polarity and if B / C
or C \ B has polarity p, then B also has polarity p
and C has the opposite polarity of p in the sequent.
#b(A)=
if A = b
if A primitive and A ~ b
#b(B)-#b(C)ifA=B/CorA=V\B
or A = C - o
A consequence of only allowing the ( - o R) rule,
which is easily proved by induction, is that in any
derivable sequent --o may only appear in positive
polarity. Hence, - o may not occur in the (cut) formula A of a (Cut) application and any subformula
B -o A which occurs somewhere in the prove must
also occur in the final sequent. When we assume the
final sequent's RHS to be primitive (or --o-less), then
the (-o R) rule will be used exactly once for each
(positively) occuring -o-subformula. In other words,
(-o R) may only do what it is supposed to do: extraction, and we can directly read off the category
assignment which extractions there will be.
B
[.#b(B) + #b(C) ifA = B. C
The invariant now states that for any primitive b,
the b-count of the RHS and the LHS of any derivable
sequent are the same. By noticing that this invariant
is true for (Ax) and is preserved by the rules, we
immediately can state:
Proposition 2 (Count Invariant) If I-sb L U ==~
A, then #b(U) = #b(A) fo~ any b ~ t~.
Let us in parallel to SDL consider the fragment of it
in which (/R) and (\R) are disallowed. We call this
fragment SDL-. Remarkable about this fragment is
that any positive occurrence of an implication must
be --o and any negative one must be / or \.
We can show Cut Elimination for this calculus by a
straight-forward adaptation of the Cut elimination
proof for L. We omit the proof for reasons of space.
2.2
L a m b e k Grammar
Proposition 1 (Cut Elimination) Each
Definition 3 We define a Lambek grammar to be a
SDL-derivable sequent has a cut-free proof.
quadruple (E, ~r, bs, l) consisting of the finite alphabet of terminals E, the set jr of all Lambek formulae
generated from some set of propositional variables
which includes the distinguished variable s, and the
lezical map l : ~, --* 27 which maps each terminal to
a finite subset o f f .
The cut-free system enjoys, as usual for Lambek-like
logics, the Subformula Property: in any proof only
subformulae of the goal sequent may appear.
We extend the lexical map l to nonempty strings
of terminals by setting l ( w l w 2 . . . w ~ ) := l(wl) ×
l(w~) x ... x l(w,) for w l w 2 . . . w n E ~+.
In our considerations below we will make heavy use
of the well-known count invariant for Lambek systems (Benthem 88), which is an expression of the
resource-consciousness of these logics. Define #b(A)
(the b-count of A), a function counting positive and
negative occurrences of primitive type b in an arbi-
The language generated by a Lambek grammar G =
( ~ , ~ ' , b s , l ) is defined as the set of all strings
w l w ~ . . . w n E ~+ for which there exists a sequence
97
x==~x
x==~x
B~, B2, C~, C2, cn+l, bn+l => y (*)
2x(-on)
B~, B2, C~, C2, cn, bn ~ c --o (b --o y)
(]L)
A2, B [ , B2, C~, C2, cn, bn =* x
A n--1
1 , A2, B~, B2, C~, C2, c, b =v x
A~ -1, A2, B~', B2, C~, C2 =~ c -0 (b -0 x)
A?, A2, B~, B2, C{~, C2 ==>x
2x(--on)
(/L)
Figure 3: Proof of A~, A2, B~, B2, C~, C2 =~ z
of types U E l ( w l w 2 . . . w n ) and k k U ~ bs. We
denote this language by L(G).
{a, b, c} as follows:
l(a) := { x/(c ---o (b -o x)), xl(c ---o (b -o y)) }
An SDL-grammar is defined exactly like a Lambek
grammar, except that kSD k replaces kl_.
= )41
Given a grammar G and a string w = WlW2... wn,
the parsing (or recognition) problem asks the question, whether w is in L(G).
----CI
It is not immediately obvious, how the generative
capacity of SDL-grammars relate to Lambek grammars or nondirectional Lambek grammars (based
on calculus LP). Whereas Lambek grammars generate exactly the context-free languages (modulo the
missing empty word) (Pentus 93), the latter generate all permutation closures of context-free languages (Benthem 88). This excludes many contextfree or even regular languages, but includes some
context-sensitive ones, e.g., the permutation closure
of a n bn cn .
= A2
= C2
The distinguished primitive type is x• To simplify
the argumentation, we abbreviate types as indicated
above•
Now, observe that a sequent U =~ x, where U is the
image of some string over E, only then may have balanced primitive counts, if U contains exactly one occurrence of each of A2, B2 and C2 (accounting for the
one supernumerary x and balanced y and z counts)
and for some number n >_ 0, n occurrences of each
of A1, B1, and C1 (because, resource-oriented speaking, each Bi and Ci "consume" a b and c, resp., and
each Ai "provides" a pair b, c). Hence, only strings
containing the same number of a's, b's and c's may
be produced. Furthermore, due to the Subformula
Property we know that in a cut-free proof of U ~ x,
the mMn formula in abstractions (right rules) may
only be either c - o (b --o X) or b - o X, where
X E {x,y}, since all other implication types have
primitive antecedents. Hence, the LHS of any sequent in the proof must be a subsequence of U, with
some additional b types and c types interspersed.
But then it is easy to show that U can only be of
the form
Anl, A2, B~, B2, C~, C2,
since any / connective in U needs to be introduced
via (/L).
Concerning SD[, it is straightforward to show that
all context-free languages can be generated by SDLgrammars•
P r o p o s i t i o n 4 Every context-free language is gen-
erated by some SDL-grammar.
P r o o f . We can use a the standard transformation
of an arbitrary cfr. grammar G = (N, T, P, S) to a
categorial grammar G'. Since - o does not appear
in G' each SDl_-proof of a lexical assignment must
be also an I_-proof, i.e. exactly the same strings are
judged grammatical by SDL as are judged by L. D
Note that since the {(Ax), (/L), (\L)} subset of I_
already accounts for the cfr. languages, this observation extends to SDL-.
It remains to be shown, that there is actually a proof
for such a sequent• It is given in Figure 3.
Moreover, some languages which are not context-free
can also be generated.
The sequent marked with * is easily seen to be derivable without abstractions.
E x a m p l e . Consider the following grammar G for
the language anbnc n. We use primitive types B =
{b, c, x, y, z} and define the lexical map for E =
A remarkable point about SDL's ability to cover this
language is that neither L nor LP can generate it.
Hence, this example substantiates the claim made in
98
(Moortgat 94) that the inferential capacity of mixed
Lambek systems may be greater than the sum of
its component parts. Moreover, the attentive reader
will have noticed that our encoding also extends to
languages having more groups of n symbols, i.e., to
languages of the form aln a2n . . . a nk •
Finally, we note in passing that for this grammar the
rules ( / R ) and (\R) are irrelevant, i.e. that it is at
the same time an SOL- grammar.
The word we are interested in is v wl w 2 . . . w 3 m .
We do not care about other words that might be
generated by Gr. Our claim now is that a given
3-Partition problem F is solvable i f a n d o n l y i f
v wl . . . w3m is in L ( G r ) . We consider each direction
in turn.
3
P r o o f . We have to show, when given a solution to F,
how to choose a type sequence U ~ l ( v w l . . . w z m )
and construct an SDL proof for U ==~ a. Suppose
`4 = { a l , a 2 , . . . , a 3 m } . From a given solution (set
of triples) A1,`4~,... ,-Am we can compute in polynomial time a mapping k that sends the index of
an element to the index of its solution triple, i.e.,
k(i) = j iff ai e `4j. To obtain the required sequence
U, we simply choose for the wi terminals the type
d i d • bk(i) • c k(i)
~("~) (resp. d/bk(3m) • cS(a3"~)
k(3m) for W3m).
L e m m a 5 ( S o u n d n e s s ) I f a 3-Partition problem
F = ( A , m , N , s ) has a solution, then v w l . . . w 3 m is
in/(Gr).
NP-Completeness of the Parsing
Problem
We show that the Parsing Problem for SDLgrammars is NP-complete by a reduction of the
3-Partition Problem to it. 6 This well-known NPcomplete problem is cited in (GareyJohnson 79) as
follows.
Instance:
Set ,4 of 3m elements, a bound N E
Z +, and a size s(a) E Z + for each
a E `4 such that ~ < s(a) < ~- and
~o~
s(a) =
Hence the complete sequent to solve is:
a/(b 3 •b
mN.
did
Question:
Can `4 be partitioned into m disjoint
sets ` 4 1 , ` 4 2 , . . . , A m such that, for
1 < i < m, ~ a e . a s(a) = N (note
that each `4i must 'therefore contain
exactly 3 elements from `4)?
Comment: NP-complete in the strong sense.
(*)
•
N •...•c
N
m
-o
d)
%(1)
(a3,.-1)
d i d • bk(3m-1) • c Sk(am-1)
Let a / B o , B 1 , . . . B 3 m ~ a be a shorthand for (*),
and let X stand for the sequence of primitive types
bk(3m), c~(,,~,.)
k(3m),bk(3m-l), c~(,~.,,-~)
k(3,~_l),...bko),
c~(,~,)
k(1)"
Using rule ( / L ) only, we can obviously prove
B1, . . . B3m , X ::~ d. Now, applying (--o R) 3 m + N m
times we can obtain B 1 , . . . B 3 m =~ B0, since there
are in total, for each i, 3 bi and N ci in X. As final
step we have
AoAo ...oA
k t~mes
We then define the SDL-grammar G r = (~, ~ , bs, l)
as follows:
B I , . . . B 3 m ~ B0 a ~ a
a / B o , B I , . . . B3m ~ a
(/L)
{v, wl,..., warn}
5t" := all formulae over primitive types
B = {a,d}UUi=,{m bi,c,:}
bs :--= a
which completes the proof.
[]
L e m m a 6 ( C o m p l e t e n e s s ) Let F = (.4, m , N , s )
be an arbitrary 3-Partition problem and G r the corresponding S D L - g r a m m a r as defined above. Then F
has a solution, if v w l . . . w3m is in L ( G r ) .
•
for l < i < 3 r n - l :
l(wi) :=
bko)
•c
dlb
/ k(3m)• cS(a3")
k(zm)
Here is our reduction. Let F = (`4, m , N , s ) be
a given 3-Partition instance. For notational convenience we abbreviate ( . . . ( ( A / B I ) / B ~ ) / . . . ) / B n by
A / B ~ • . . . • B2 • B1 and similarly B , - o ( . . . (B1 --o
A ) . . . ) by Bn • . . . • B2 • B1 --o A, but note that this
is just an abbreviation in the product-free fragment.
Moreover the notation A k stands for
p, : =
•
3 •...•b3m ac N
UJ.<./<m d / d • bj • c: (~')
P r o o f . Let v w l . . . W3m 6 L ( G r ) and
6A similar reduction has been used in (LincolnWinkler 94) to show that derivability in the multiplicative
fragment of propositional Linear Logic with only the connectives --o and @ (equivalently Lambek calculus with
permutation LP) is NP-complete.
a/(b?
. . . . . e mN -o d), B1,. • • Bsm ~ a
be a witnessing derivable sequent, i.e., for 1 < i <
3m, Bi E l(wi). Now, since the counts of this sequent must be balanced, the sequence B 1 , . . . B 3 m
99
must contain for each 1 _< j < m exactly 3 bj and
exactly N cj as subformulae. Therefore we can read
off the solution to F from this sequent by including
in Aj (for 1 < j < m) those three ai for which Bi
has an occurrence of bj, say these are aj(1), aj(2) and
aj(3). We verify, again via balancedness of the primitive counts, that s(aj(1)) ÷ s(aj(2)) + s(aj(3)) = N
holds, because these are the numbers of positive and
negative occurrences of cj in the sequent. This completes the proof.
[]
References
J. van Benthem. The Lambek Calculus. In R. T. O.
et al. (Ed.), Categorial Grammars and Natural Language Structures, pp. 35-68. Reidel, 1988.
Computers
and Intractability--A Guide to the Theory of NPCompleteness. Freeman, San Francisco, Cal., 1979.
J.-Y. Girard. Linear Logic. Theoretical Computer
Science, 50(1):1-102, 1987.
M. R. Garey and D. S. Johnson.
The reduction above proves NP-hardness of the parsing problem. We need strong NP-completeness of
3-Partition here, since our reduction uses a unary
encoding. Moreover, the parsing problem also lies
within NP, since for a given grammar G proofs are
linearly bound by the length of the string and hence,
we can simply guess a proof and check it in polynomial time. Therefore we can state the following:
E. Khnig. LexGram - a practical categorial grammar formalism. In Proceedings of the Workshop on
T h e o r e m 7 The parsing problem for SDI_ is NP-
P. Lincoln and T. Winkler. Constant-Only Multiplicative Linear Logic is NP-Complete. Theoretical
Computer Science, 135(1):155-169, Dec. 1994.
complete.
Finally, we observe that for this reduction the rules
(/R) and (\R) are again irrelevant and that we can
extend this result to SDI_-.
Computational Logic for Natural Language Processing. A Joint COMPULOGNET/ELSNET/EAGLES
Workshop, Edinburgh, Scotland, April 1995.
J. Lambek. The Mathematics of Sentence Structure. American Mathematical Monthly, 65(3):154170, 1958.
M. Moortgat. Residuation in Mixed Lambek Systems. In M. Moortgat (Ed.), Lambek Calculus. Mul-
timodal and Polymorphic Extensions, DYANA-2 deliverable RI.I.B. ESPRIT, Basic Research Project
6852, Sept. 1994.
4
Conclusion
We have defined a variant of Lambek's original calculus of types that allows abstracted-over categories
to freely permute. Grammars based on SOl- can
generate any context-free language and more than
that. The parsing problem for SD[, however, we
have shown to be NP-complete. This result indicates that efficient parsing for grammars that allow for large numbers of unbounded dependencies
from within one node may be problematic, even in
the categorial framework. Note that the fact, that
this problematic case doesn't show up in the correct
analysis of normal NL sentences, doesn't mean that
a parser wouldn't have to try it, unless some arbitrary bound to that number is assumed. For practical grammar engineering one can devise the motto
G. Morrill. Type Logical Grammar: Categorial Logic
of Signs. Kluwer, 1994.
M. Pentus. Lambek grammars are context free. In
Proceedings of Logic in Computer Science, Montreal,
1993.
avoid accumulation of unbounded dependencies by
whatever means.
On the theoretical side we think that this result for
S01 is also of some importance, since SDI_ exhibits
a core of logical behaviour that any (Lambek-based)
logic must have which accounts for non-peripheral
extraction by some form of permutation. And hence,
this result increases our understanding of the necessary computational properties of such richer systems. To our knowledge the question, whether the
Lambek calculus itself or its associated parsing problem are NP-hard, are still open.
100
Download