From: AAAI Technical Report SS-95-01. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved. Two-dimensional clusters grammatical relations in Mats Rooth 1 Introduction If someone asked me to explain the word "product" as used below, I might say that it refers to things like drugs, cars, and software which are developed, produced, sold, and used. a new product. (i) The companyannounced Whileit is drugs, carsandsoftware whichareserving as examples of products, thelistof verbshasa roleas well.If I merely saidthatproducts werethings likedrugs, cars,andsoftware, someone wouldhardly be in a position to say whether toothpaste countsas a product. And a definition of "product" as simply"something whichis produced" getsa moregeneral senseof theword, including thingssuchas toxicwasteproduced by an industrial process, and carbondioxide produced by humanrespiration. Theexplanation of "product" by multiplied example involves the implicit claimthatdrugs, cars,andsoftware arealldeveloped, andallproduced, all sold,andallused.Notonlyis thistrue,butin a largeenough textcorpus we canlindexamples examples of drugs, cars,andsoftware beingdescribed as beingdeveloped, produced, soldandused,in allcombinations: (2)a. ~95~,.2 Theirgoalis to develop newdrugsto compete in worldmarkets b. 44~u~Genzyme sellsspecialty chemicals usedby othercompanies to produce drugs, diagnostic testsand otherproducts andperforms contractresearch fordrugcompanies. " c. 442~17s We absolutely mustbe in theU.S.market by 1992," he says , " to sellthenewdrugsourresearch labswillbe producing. " d. ~,7~4Milton Bass,a NewYorkattorney forZenith, asserted thatthe appeals court’s ruling willspurcivil suitsagainst otherdrugcompanies thathave" conducted campaigns to frighten doctors and othersnot to usegeneric drugs. " 135 (3) s,,,2The Europeans’ problem: thehugesumsit takesto develop new carsand bringthemto market. Theeightmajorautomakers didn’t produce anycarslastweek b. ,7o~2o because of theholidays This year,forthefirst C. m5~2 " time, we’ve hadtoworktosellthecars ,"says Michael J. Jackson ,aSaab dealer in suburban Washington , D.C. do 61~3~- A haltto importsof sedans, plusa requirement thattop officials useonlyChinese-made cars. pavesthewayfor thecompetition to develop (4) a. ~2~,n" Thissettlement software in theoperating systemarena. " b. ~6 He guessedit willtakeMicrosoft six to 12 monthsto produce the software basedon theSybasetechnology. ~, The company operates retailstoresthatsellcomputer software Co d. ~7,~Theyare writinga bookto showcampaign managers how to use Lotus1-2-3software to analyze localvoting habits. Theseexamples are froma corpusof WallStreetJournal newspaper stories, Containing roughly 60 million word-like tokens (Liberman [Liberman, 1992]). Then-tuber at thebeginning of an example indicates itsposition in thecorpus. Thiscombinatory pattern evident in thesesentences is of independent interest, because it suggests a simple wayof capturing selectional patterns relatingverbsto theirobjects, or lexical pairsparticipating in othergrammatical relations. In a number of problems arising in computational linguistics, we needto be ableto decide whether a givenphrase is an appropriate filler fora grammatical relation assigned by another word.Thisoftencomesdownto a question of semantic compatibility between theheadof thefiller phrase and thesemantic roleassociated withthegrammatical relation. For instance, in thesentence below,resulted and approved bothhavemorphological readings as bothtensed verbsandpastparticiples. resulted froma settlement approved yesterday. (5) Thecharge Themorphological indeterminacy results in syntactic ambiguity. In thefirst structure in thefirstrowof tableI , charge is thesubject of resulted and approved a postmodifier of settlement. In the second analysis, resulted is a postmodifier of charge, and charge is the subject of approved. Deciding between the syntactic analyses below comes down, at least in part, to a matter of semantic selection. In choosing an analysis, it would be useful to know (amongother things) whether settlement is a good object for approve (it is) and whether charge is a good subject for approve (it is not). (Note that in terms of the underlying role assigned, the postmodified noun phrases are equivalent to objects of their modifiers.) The sentence below has the same ambiguity, 136 ~P N~R iwl,m~,~.~r mumcemn~ V obq:"" ~^a ~ P ~iIlw&Io*,,~ wqm im:z~m~ ¯ Table 1: First row: corrext and incorrect syntactic analyses for sentence (x). Secondrow: incorrect and correct syntactic analyses for sentence (y). andinthiscaseit isthesecond syntactic analyis whichis correct. union contracts signed in thethird quarter granted slightly (6) Private-sector lowerwageincreases. Suppose thatwe wantedto attacksuchproblems usinginformation from a training corpus. Sincethousands of verbsand no-nsare involved, we can notconclude thata givenverb-object pairis impossible, simplybecause we can not findit in the corpus. Takefor instance a slightly moreuncommon productverbsuch as e~ort,and a moreuncommonproductnoun,suchas engine. Although theverb-object paire~ortengine is intuitively plausible, it wasnotdetected at allby the method described belowin a six million word training corpsus. A selection modelwhichgeneralizes amongwordshas the potential of solving theproblem, sincewhiletheverbexport e~ortdoesnot occurwithengine, it occurs withotherproduct nouns, andengine occurs with otherproduct verbs. Thepurpose of thispaperis to develop a mathematical andcomputational modelwhichcaptures thenotion of a selectional dependency between a setof verbsanda setof nouns, or moregenerally twosetsof wordsparticipating in a grammatical relation. Section 2 describes a simple categorical selection model, whichin section 3 is givena probabilistic twist. Locally optimal probabilistic modelscan be generated by an incremental procedure similar to Ba-m-Welch 137 acquire 16 bo~t buy 16 climb cut de~Jine drop 1 dump 1 f-II 1 1 1 2 2 48 1 2 8 1 104 18 2 19 2 3 I 1 3 2 I 132 20 7 1 2 28 3 20 3 8 2 1 23 1 76 bold t 18 |~1 6 3 jump 2 lower plunlie purcha~ 8 push 1 2 r~se reduce g 1 ret~n -1 rime 13 9 138 18 sell 114 2 1 40 slmdl 17 tr~le 1 1 2 2 2 2 I 18 2 35 6 7"r I I 1 39 21 8 28 53 I 36 348 2 10 13 ! 10 11 1 66 64 11 1 13 9 16 1 2 1 2 30 9 6 5 2 10 1 2 14 171 38 33 3 2 11 62 9 25 I 22 3 5 68 5 26 8 1 3 26 36 2 1 36 2 8 9 2 1 23 83 88 2 14 4 2 I 17 6 98 1 3 44 20 1 4 5 28 8$ 131 149 26 5 105 3 S 1 22 56 8 9 1 13 17 -2 2 3 52 125 18 19 1 6 7"J 2 1 12 8 48 243 4 9 2O 6 9 ’;’ 22 87 29 59 10 107 190 5 19 1 29 2 30 3 19 5 7 10 1 4 121 75 16 24 74 41 21 144 1 2 25 30 2 11 1 2 28 28 3 1 16 1 2 2 20 6 16 1 1 4(5 2 26 1 3 1 1 149 104 1 3 3’r 5 4 2 1 11 1 21 1 2 22 2 3 1 Table2: Frequency counts for24 verbsand24 object heads. reestimation of hidden Markovmodels. In section 4 we lookat results fora reasonably largesample of verbsan nouns. Section 5 discusses applications of theprobabilistic model, giving prelimi-ary results fora parsing problem. 2 Categorical selectiontypes Table(2) is a matrixof frequency countsforverb-object occurrences twenty-four verbsandtwenty-four nouns. The tableis a sub-part of a verbobjectmatrixderived froman approximately 6 million words~mpleof the II WallStreetJournal, parsedby DonaldHindlewithhis Fidditch parser. extracted verb-noun pairswitha Lispprogram fromlistrepresentations of parses, and mappedthemto uni-flected formsusinga fullformwordlist. The resulting listcontained 84182non-zero frequency counts. The frequency matrixwas reduced by eliminating rows(corresponding to verbs)withfewer thanfivenon-zero entries, andsubsequently eliminating col-m-s withfewer thanf~venon-zero entries. Thisgavea matrix indexed by 992verbsand1027 nouns, containing 55251non-zero entries. Theverbsandnounsin the24 × 24 sub-table wereselected by handforillustrative purposes. Re-arranging the rowsand columsin thesmalltablebrin~out a dependencybetween the rowsandcolumns. In table2, mostof thenon-zero entries arein thethreeblockson the diagonal. We can thinkof theblockorganization as capturing threesemantic selectional typeswithin thispartof the verb-object grammatical relation. Thefirstblockcorresponds to a notion of IThe parser isdescribed in[Hindle, 1983] and[Hindle, 1994]. S;rnilar verb-object data is discussed in [Church et al., 199x]. 138 "~ ~quire i 16 buy 16 dump 1 bold 18 purchm 8 ret~n I sell 114 trado 1 climb decline dl.op 1 hdl |sin jump rise plun|e boost cut incTe~m 6 Imem. push 1 reduce slseh 9 38 5 48 53 38 3 2 7 22 5 5 17 6 13 40 72 48 2 7 77 S48 10 68 98 17 243 22 87 29 19 1 107 190 29 2 2 10 121 30 3 24 20 6 21 1 3 144 149 104 2 1 2 37 5 1 1 3 2 $ I 1 2 I 1 1 3 25 4 2 1 1 28 8 1 2 1 8 5 1 36 1 4 6 9 1 2 59 10 8 2 78 2 16 1 16 74 41 2 1 1 2 I 13 9 1 3 1 2 1 I 1 1 1 1 2 to ! 2 2 3 1 2 1 I 2 1 2 6 12 8 2 1 2 9 8 2 10 13 1 1 18 13 2 1 9 16 lg 2 1 2 30 7 9 e 5 132 I 2 14 171 28 38 33 20 2 11 62 28 g 25 g 3 2 8 136 2 3 52 12822 18 18 19 2 ’ 3 2 14 4 1 1 1 2 I 18 39 21 5 1 19 1 104 10 11 66 64 11 30 5 2 1 3 25 5 26 26 36 2 11 16 2O 2 23 83 55 2 4 3 1 1 44 20 1 2 5 1 23 5 28 131 149 26 46 11 1 1 76 105 3 22 55 8 26 21 17 4 9 20 6 3 3 Table 3: The same counts in another order. exchange of financial instruments or ownership interests. The second block involves measurementof a scalar motion by some dimensioned quantity, and the third block involves changein somescalar quantity, such as a stock price. To represent the blocks, we need not re-order the matrix: we can equate a block with a pair of a subset of the verb set and and a subset of the nounset. A closer examinationshows that it is not reasonable to insist that the noun sets for different blocks be disjoint. The noun8take occurs freqently with the verbs of the third block (e.g. boost, increase, ra~e, andreduce), as well as with the verbs of the first (e.g. acquire, 5u7/, sell, and trude). Here are some examplesentences: (7) a. lu~ Most of Japan’s big computercompanieshold a 1%stake in Ascii, and Mitsui ~ Co., one of Japan’s largest trading companies, plan~ to boost its stake to 5%. b. 12~2, ThoughKoito is resisting, and Mr. Pickens has announcedplans to increase his stake to 26%,he maintains "there’s nothing hostile" about his investment. c. ~ego3, Ford will raise its Jaguar stake to the maximum 15%after the 30-day waiting period expires, the Fordexecutive said. d. ~,, The sales reduce the group’s stake to 740,500 shares. (8) a. Galesi Groupsaid it is seeking to acquireLoneStar Technologies Inc.’s stake in AmericanFederal Bankfor $58 million. b. ~o24,4 TheItalian financier is close to announcingthat he’s setting up a holding companyto buy controlling stakes in Hungariancompxnies, according to Hungariansources. 678667"r 139 c. 2~6 Domino’s Pizza owner Thomas Monaghanmay sell his 97% stake in the chain, which is the nation’s largest pizza delivery company. d. 4~7~2One possibility would be for the U.S. group, formally called Newgateway PLC, to trade its troublesome stake to Isosceles for certain Gatewayassets. The reason for the overlap is that a stake (say in a company)is both something whichcanbe bought or sold,anda scalar quantity whichcanbe increased or decreased. Theverbincrease is in principle a symmetric example, though thisis not really obvious in thesedata.It occursbothwithobjects denoting a changing scalarquantity, andwitha objects (orpseodo-objects) measuring that change: (9) a. ~,8 JackW. Forrest, Environmental Systems president and chiefexecutive oi~icer, saidthechange increases thecompany’s effective tax rateto about35% from20%. b. ~8,~ou If Mr.Wanniski’s theory is right, thata taxcutincreases the value ofcapital assets heldbyowners, wouldn’t thatincrease alsorepresenta rather dramatic andtotally unjustified inflation of those values? c. 4~02~u Georgia Gulfsaidit increased theexercise priceof therights to $120from$50. d. ~5+8o60 "Ithasincreased thelevelof caution in themarket," saidtlae Chicago trader. ~o Economists saidthe Augustcivilian unemployment ratewillhave (10)a. increased 0.1percentage pointto 5.3%. b. ~070Unleaded-gasoline futures weremixed,although the September contract increased 0.22centto settle at 54.15centsa gallon. In thelatter group, thechanging scalar quantity is realized as subject. Giventwo vocabularies V and N, we definea selection typeas a pair (V’,N’>,whereV’ C V and N’ C N. A selectional modelis simplya set of selectional types,and givenwhatwassaidabove,we shouldnotimposeany requirement of non-overlap forthenounsetsor verbsetsof a selection model. 140 A reasonable selection modelforthe24 × 24 frequency matrix is: r acquire buy dump hold ~interest security purchase retain ’|share stake ~stock unit Sc~mbdecline dropfallgain ,~footmarkpence jumprise l.point yen ,plunge increase fboostcut ./~increase lower \ Lpushraise reduce slash 3 Probabilistic |dividend price ,~raterating tax |valueshare tstakestock selection types Theaboveconstruction, because it is categorical, encodes no information about therelative freqeuncy of,forinstance, buyanddumpas verbal realizations of thefirstselectionai type.In applications, having access to graded information is useful, in thatit allows thelargenumbers o~ analyses -- suchas syntactic parses-- to be ranked.Futhermore, oncewe get beyondsimpleexamples, it is notclearthatmembership in selectional patterns should be considered 2 discrete. Amongwaysof introducing graded distinctions, probabilistic modelsareappealing, because theyhavethepotential of telling us, in complex situations, howa number of graded distinctions areto be combined. Thesimplerecipe forturning a categorical modelintoa probability modelis replace characteristic functions of setswithprobability distributions. In thepresent case, we redefine a selection typeas a pairof discrete distributions, oneonthe 3 verbsand oneon the nouns: Thefunction Avp~mapsa verbto a number in theinterval [0,1],meeting the constraint thatthesetof verbprobaiities sumsto one,andsimilarly forthe nouns.In orderto use thisnotation, we mustassumethatthe verb,noun 2Fttrthermore, I did notsaywhatmakesone categorical selectional modelbetterthan another, andhow theyareto be discovered computationally. I haveexperimented withan incremental searchforselection models, usinga Solomonoff-Kolmogoroff-Chaitin measure to evaluate thecombination of the complexity of themodelwiththecomplexity of describing thedatamatrixgiventhemodel.I willnotdescribe thismethodhere,sincethesearch was computationally expensive, andtheresultsonlymoderately encouraging. 3Avp~, generates a probability distribution AX~’~x p.r measuring subsets of V. In the text, I surpress the distinction between discrete probability distributions and their generators. 141 type 3 .481 tFpe 1 .204 tTpe 2 .315 fail .344 point .364 raise .268 ret~t .238 sell .319 share .&tO cent .2gl rise .343 reduce .189 price .207 buy .288 stsJm .198 gain .128 pence .080 cut .102 coat .143 hold .098 stock .172 drop .059 yon .0TI lower .109 stake .lOS acquire .093 interest .078 deeUue,038 price .060 increase .I00 debt .OTI pu~ .0~3 sumet.065 climb .028 ruts .089 bo~t .0~J tax .Oe3 increase .030 unit .089 jump .020 tax .021 push .036 ratin E .059 trade .027 sw.-urity .038 plunp .019 o~ .01T sl~th .034 dividend .080 boost .024 bond .037 increnm .006 cost .016 sell .010 value .044 retain .019 debt .004 push .008 bit .011 decline .009 interest .006 gain .011 price .002 trade .003 drop .006 stock .OO3 mark .011 dump.009 aversge .002 reduce .002 foot .004 trade .005 mark .003 push .009 yen .002 boo~ .001 stock .002 hold .002 avere~e.002 reduce .008 bit .001 sell .001 value .00~ retain .001 ses~ .002 raise .003 mark .001 cut .001 interest .002 acquire .001 yea .001 decline .001 foot .001 bold .001 unit .001 gain .000 share .000 rise .001 value .0400 raise .000 ses~ .000 rise .000 point .000 plunse .001 rate .0400 buy .000 shm .000 plunKm.000 cent .000 drop .000 point .000 cluh .000 rstlng .000 tall .000 pence .000 slseb .OGO coat .000 lower .000 stske .000 c/lmb .000 bond .004) lower .000 cent .000 acquire .000 debt .000 purchase .000 security .000 cut .000 dividend .004) purchase .04}0 dividend .000 boy .000 bit .000 fail .000 tsx .000 retain .000 security .000 jump .000 unit .000 jump .000 pence .000 dump .000 bond .000 dump .000 toot .000 climb .000 rstin s .000 Table 4: Parametersof a selection model. and type sets are (or have been rendered)disjoint: we do not want to identify the verb probability P~/v with the noun probability pirncrease/N . We also add a probability distribution over the types. Using an initial segmentof the natural numbersto index the types, a probabilistic selection modelfor V and N with k types then consists of a probability distribution Arprover the set of integers {1, ..., k} (= T), andfor eachtype r in {1, ..., k}, a pair of probability distributions (Avp~, Anp~), as described above. Derivatively, for any type ~" we construct a probability distribution on V × N as a product: p;,. =p;p Weconstuct a probability distribution on T × V x N as a disjoint union of these products: Such a modelcan by used to assign probabilities to verb-object pairs. In the probility space just defined, if a verb-nounpair is generated,it is generated in sometype, and we obtain the probability for the verb-nounpair by summing over the types: = E Ep1.p;p;, I" In section 5, such probabilities are used to comparetwo grammaticalanalyses. Table 3 gives the parametersof a selection modelof order three for the 24 × 24 data.4 Notice that the nounstake is rankedhigh in both the second and third types. 40r rather, as one can discover by s.mmirlg the first col,,mn of numbers,the approximate parameters. 142 L~/pe I .204 ~rpe 2 .315 fall .344 point .364 rehm.26S l~te .238 rise .343 cent .281 reduce .109 price .207 pence .080 cut .162 coot .143 I~in .128 drop .089 )qDn.OTZ lower .109 debt .071 decline .038 ovlra4~e .017 increase .100 tez .063 climb .028 bit .011 boast .OY3 min K .089 jump .020 mark .011 push .036 dividend .050 plunse .019 foot .004 sluh .034 value .044 t}rpe 3 .481 sell .319 8bin .340 buy .288 stake A98 bold .096 stork .1T2 acquire .093 intet~a .OT8 mrrk,,,e .063 ammt.068 trsdo .027 unit .059 retain .019 Imcurity .038 dump .009 bond .03T Table 5: The same model, with verbs and no-us represented only where they are mostlikely to be generated. Estimating a model Givena selection modelandobserved verb-object pair(v,n),theprobability thatit isgenerated intyper is: Pv,n Multiplying by the frequency .f~,n, we obtain the expected numberof occurrences of the event (%v, n) given the observed frequency and the model: Thisformsa basisforre-estimating theprobability parameters: e,.,,, =Z,~e,.,,,,,,e,.,~,=F_,,,e,-,,,,,,e,.=Z,,,,~e,.,,,,,, The probabilities q, the parametersof the newmodel, are computedas relative frequencies of expected numbersof events, as determined by the old model. For instance, the probability of the verb fall within type 1 wouldthe expected numberof occurences of fall in that type (given the data and model), divided by the expected numberof occurrences of that type of verb-object pair. The formulas are similar to the Bantu-Welchre-estimation formulas for hidden Markov models (Bantu 1972). s Adapting Baum’s result for HMMs, it can be shownthat an iterative re~estimatiom of parametersproduceslocal improvementsin the probability of the observed data given the model, and converges to a local maximum of this probability. Table 3 was derived in one hundrediterations starting from a randomstate, using the frequency data in 5Selection modelsas described herecan be viewedas zero-order HMMs,augmented with a secondsurface vocabulary andassociated emission probabilities. Thatis,givena state (ortype),two surfacesymbolsare independently generated. Furthermore, as usedhere, thetypesarehiddenin thesensethattheyaregivenno priorinterpretation, andarenot observed in the data. 143 Table2. Table3 is a different wayof looking at thesamemodel: eachverbor nounis is shownonlyin thetypewhereit is mostlikely to be generated, i.e. whereprp~(orp~.p~) is maximal. Thisrepresentation reconstructs thethree eight-by-eight boxesof table2. 4 Results for a larger data matrix Storagerequirements for the estimation algorithm are modest.Thereare [T[]V[ + ITI]N] + [T[ probability parameters. There-estimation formulas sum overnon-zero frequencies, andthe expectations canbe computed by summing iteratively oversuchfrequencies. In eachstep,a frequency fv,nis apportioned amongthetypesaccording to theratio~.~,and theportion fora typev is addedto running subtotals of e,,~ande~,n.Thisprocedure requires intermediatestorage of thesamesizeas theprobability model. Sinceno random access to the frequencies is required thereis no needto represent a V x N matrix. (Thismightturnoutto be useful. In a larger datamatrix basedon 60 million wordsof text,about3500verbrootsoccurwithfiveor moredifferent nouns, andabout7500nounroots, notcounting proper no-heor numbers, occurwith fiveor moreverbs.One of thesenumbers wouldbe multiplied if verbs-with complements accompanied by prepositions wereincluded as separate entries.) The algorithm was implemented in CommonLisp.Tables4, 4 and 4 give themostprobable nounsandverbsin eachtypeof a probabiity modelforthe 992x 1027matrix withthirty-two types, obtained withfourhundred iterations. Thethreeblocks of 2 arerepresented here:type3 is theproduct type(e.g. develop software), changing dimensionedobjects (raise price) are in type 8, and scalar increments (rise cent) in type 26. Type 30 is a related one where the object typically denotes a scalar motionevent, such as a decline or an increase. The nouns of types 19 and 29 primarily name people. In the nouns, the split seems to amount roughly to a distinction between powerful, active people (executives, lawyers, officials) and weakor passive ones (workers, clients, shareholders). Looking at the verbs, the type 19 people are appointed and °replaced; the type 29 people are given orders and permissions. Several types are dominated by a commonand semantically empty verb, such as be (12), have (23), or make(31). In these cases, the noun sets are not intuitively coherent, presumably because these verbs impose such weak selectional restrictions. The verb sets are not particularly coherent either, though this is balanced by the fact that most of the verbs have low probabilities. These common verbs also occur in manyother types, for instace be in top position in type 32, which is an intuitively coherent type. Sin verb group 19, a number of items resulting from parsing errors are evident. Presumably, manage comes from managing director misidentified as a verb phrase. 144 t~/pe 1.015 ~pe 2 .021 meet .23606 standard .06597; end .10815 ysar .11873 set .15089 record .05207! trade .07582 Friday .09733 be .04807 Soaa .04920 say .06728 w~k .08977 keep .042O8 need .04483 work .O~l~S, month .08074 bit .03637 requiremmlt .04162 bel;in .06120! d&y .06488 mt-, .03403 level .03564 spend .08076 time .04698 exceed .03322 demand .03168 ~anounco .02880 tod~r .04067 reach .02533 tarpt .02963 close .02638 ~.aday .03573 ~.hieve .01923 high .02828 open .02420 hour .00881 fulfill .01518 .02678 start .02219 Monda~r .02724 pace sut ~..01330 .02158 expire .01718 wsy .02458 llve .01241 date .01860 mark .01818 Wadnwday .02287 surpus .01163 pertinent .01819 lut .01548 c&pitai .02116 ut~bllsh .01126 expocta~on .01764 wait .01396 taik .01933 point .01623 strc~ .00919 follow .01147 night .01526 :eliminate .00888 limit .01898 serve .0U17 w~r .01030 typc 4 .029 t~e 3 .032 type 6 .035 use .16863 product .07639 continue .06015 oper~ion .08075 complete .08198 sslc .10672 de,lop .09296 system .04469 be~n .08647 effort .07703 flnLnco .08686 ! acq,,;sition .06132 produce .08613 drug .03406 16unch .04128 progx~un .0g099 include .04893 transaction .08817 introduce .03273 teebnololD, .03110 conduct .03157 csmpaJgn .02961 be .03700 chamge.04977 markct .02933 computer .02236 tlsum@ .02848 production .02943 .03244 purchsee .04142 sell .02778 country .01948 I expand .02828 iuwlctiSatlon .02209 ,,pprove .03176 merger .03092 include .03140 car .01875I start .02476 service .02029 rcprseeut .02?03 order .0270g s~p .02109 equipment .01685 I be .02280 proem .01670 ~ounce .02536 incruue .02646 supply .01770 line .01666 my .02189 work .01808 block .02272 action .02278 version .01500 h-St .01456 negotiation .01414 consider .02208 Set .O16O2 ta~keover .01989 distribute .01149 machine .01410 follow .01338 psyment .01407 involve .01758 proje~ .01827 test .01145 procmKI .01391 ban .01304: activity .01a83 propose .01702 imue .01792 msnufacture .01107 prosrnm .01230 r~trict .01189: practice .01290 explore .01670 offering .01475 promote .01035 ton .01226 support .011871 talk .01264 expect .01570 move .01278 lnm~l .00983 software .01178 o~mme .01148 development .01288 follow .01496 use .01208 build .00971 model .01146 )i~n .01145 use .01224 discuss .01474 iuvwtment .01098 t, ~e 6.024 t]vpe 7.023 We 8.041 file .14736 suit .11129 p~ .09036 money .20670 raise .17768 price .15348 follow .09798 cm .06304 spend .07740 S .07810 incrom .09248 rm .15152 settle .04932 report .06176 raise .05364 role .05575 boost .07783 st’ke .047234 deny .03748 decision .04991 be .03728 fund .06401 reduce .04806 question .03490 imue .03688 charSc .04782 cub .03552 lower .04484 set .03375 rating .03048 he~r .03011 Iowsuit .03991 Isee .03372; lut .02660 cut .03509 v~ue .02881 gay .02876 st&tement .03674 ccot .03356 time .02688 .02377 sale .02844 8&vg dismiss .02466 &pp~ .02439 .03050 dollar .02812 offer .02240 earning .02379 bring .02329 complaint .02390 use .02727 part .02037 ptmh .01979 number .02363 : be .02301 cJsJm.02312 put .02146 i million .01805 expect .01608 level .02135 connrm .02157 ailelKatiou .021421 lend .01760 gsme .01680 keep .01407 doUar .01719 appeal .02020 .01720 cupitai .01508 ruling .01910; P~Y disclcoe .01198 c~pitai .0170Y include .02019 piss .01561 inv~t .01698 total .01428 double .01191 revenue .01661 review .01744 action .01490 need .01668 billion .01380 lu~wer .01182 sise .01569 reverse .01627 rumor .01408 give .01620 llfe .01291 bring .01062 ¢cet .01369 see .01521 recommendation .01402 add .01520 7ear .01182 maintain .01042 production .01367 ~/pe9 .023 t,/pe 10.014 11 .032 reject .08033 bid .15796 hold .42625 company .29762 I~y .21425 cost .13843 consider .06530 prupceai .11701 acquire .04866 meeting .09623 reduce .14588 debt .08962 accept .08288 call .04400 o~r .11493 ~ .05878 cut .09988 rex .05801 suppo~ .Q3836 plan .06008 attend .02986 t41k .04176 : incremm .03444 dividend .03704 decline .03064 comment .03193 schedule .02436 hesrinK .03762 I cover .02748 price .03451 drop .02975 requ~t .02891 lesve .02433 conference .03723 include .02187 $ .02684 close .02825 ides .02331 .02301 pceitiou .02211 be .01433 fee .02370 submit .02401 attempt .01 774 tell .02178 election .01934 repay .01388 expense .02319 be .02389 claim .01346 force .02111 concern .01878 slub .01313 deRclt .02184 race/re .02382 effort .01319 follow .01705 discussion .01367 keep .01286 interest .019U upprove .02158 issue .01301 be .01290 poet .01207 limit .01228 bill .01868 launch .02135 strateSy .01216 mine .01207 stock .01060 control .01161 IoLn .01818 b4w.k .01941 option .01175 coutrui .01057 hoct~e .01017 ovoid .00944 ~capitaJ-Kains .01785 withdrsw .01771 application .01093 mk .00993 p~ty .00770 trim .00897 premium .01715 review .01711 ameudment .01044 expect .00783 ~ion .00729 raise .00868 mount .01678 eanouuce .01688 $.00988 v~ue .00758 interest .00715 colh~t .00847 force .01879 Table 6: Partof a 32typeselection model. 145 13 .026 t~ 14 .030 put .03984 do .20440 business .15112 show .10237 075o~ become .02744 w~y .02070 run .10720 job .09334 reflect .06022 Krowth .06279 find .01g18 time .02(JO1 be .09109 thin K .05915 express .03968 vlJue .03758 see .00639 pnmident .02846 cr~te .06320 compe~y .03691 m .03130 economy .03209 live .00600 cempe~y .01500 form .05178 work .03633 improve .02T87 concern .03047 include .00383 reason .01314 lesve .03496 venture .02406 slow .01795 ¯ demand .03012 pt .00316 lot .01245 Set .O3488 lot .01903 coy .01794 siKn .02552 identify .00295 problem .01181 and .01K91 :veroment .01850 enhance .01611 performsnce .02153 remLin .00257 chairmen .01181 m .01579 fund .01547 continue .01538 mum .02080 cite .00242 bume .01158 see .01489 pro~,.ol~t pursue .01512 confidence .02062 ropromnt .00222 cue .01024 eliminate .01072 d,,,-s .01143I grow .01808 ahillty .02002 Nee .00217 unit .00933 imp .0O987 .01139 fuel .01503 strenKth .01637 prove .00217 sign .00926 lose .00979 risk .00082 indicate .01460 inflntion .01438 -tlow .00201 thin .00897 expect .00978 year .0(0)40 cite .01430 benefit .01353 K went .m 8yetem .0(0J26 represent .01363 tez’V~ .00818 ente.r .008~J support .01348 msrk .00197 Nlmblieh .00798 room .00784 mmmumlmmnmmmm ~mmlm 17 .020 ~_6 .o22 qFeemcnt msr .25364 n plen .18452 can-/ .088~ dividend .08273 expead .07881 pruluro .06255 siK .11081 announce .08203 bill .08091 receive .06249 wsmmt .04434 put .06053 business .03365 &pprmm .06146 contract .04019 write .08310 memmge.03976 enter .00256 bolrd .02548 peas .04008 leKisi&tion .03891 declare .04667 book .02798 be .03619 c&pacity .02433 be .03342 letter .02473 note .02597 I~ .02317 industry .02232 Set .04338 ueloti&te .02940 accord .02398 publish .03854 lignal .02243 tep .01821 membership .02207 s~y .02620 settlement .02249 iaue .02821 stock .02046 dominste .01791 line .01869 introduce .01946 dcsJ .01028 rend .02800 ~rticle .02037 keep .01595 economy .01522 .022g7 term/n~te .01562 mmure .01622 deliver informstion .01688 open .01580 Iron .01442 spend .01310 pact .01546 include .01617 sentence .01583 serve .01438 b4um.01228 ,,4 .01531 veto .01293 psckqe .01450 Iil~m .01473 hit .01415 end .01221 putt .01408 competition -,4opt ,01238 level .01253 enbordinMe .01438 price .01447 .0U85 propose .01171 resolution .01110 see .01412 news .01350 &fleet .01397 country .01128 .01142 proKrm .01028 suspend .01406 report .01294 brin .01250 side .01123 fOrKe K diecmm .01009 bcsd .0U19 decision .00951J~ ~ typ 18 .021 &w . impels .04621 riKht .083091 adopt .04096 rulc .07114 ce~ .03129 policy .08715! exercise .03047 provision .04382! include .02978 rmtriction .04050 I tighten .02341 credit .03410 propose .02050 ban .03290 enforce .01916 r~ulation .02586 use .01698 security .02819 control .022TT chamKe.01665 br~.k .01621 option .02191 extend .01587 8t&nds~l .01958 lib .01441 requirement .01490 oppcee .01376 tax .01227 remove .01323 mmlmmlmm 20 .015 ~1.0~-r6------.0835K usume .06687 position .1 fill .04823 become.11146 presidcnt .0K178 account .06605 iud .09514 mortK~KC .06931 meln .04154 control .04908 cap .07303 group .05734 n.mq~ .07970 offlci~d .08558 enldyst .04518 return .03868 powlr .04743 bsnk .07174 company .06733 be .00654 rn~sln .05788 chsirmen .04384 be .02827 consult .06777 &ttention .08347 imsKe .04092 8hm .02387 responsibility .03439 hcad .05826; concern .03268 tell .03499 executive .04324 include .03194 m~q~er .03607 strenlthcn .02213 pcet .03189 draw .04489; force .02934 name .02968 pa~ner .02709 malnt-;n .02148 csJl .03184 indicate .03417 coupon .02364 member .02248 seice .02047 view .01745 en~inco1" .02217 hire .021K1 list .02076 ml.r ket .0133K trader .01837 bur .01857 home .01714 trade .01944 indicator .01819 set .00933 bosrd .01599 chenge .01815 ownership .01703 attract .01666: unit .01787 elect .00929 firm .01408 improve .01648 job .01598 form .01528 of~os .017T2 ouet .00924 l&w3~r .01362 use .01636 order .01501 etsSe .01513 bom’d .01306 uk .00854 company .01347 shii~ .01626 nsJme .01296 focus .01419 operltion .01214 &ppoint .00786 consultant .01221 place .01532 title .01296 turn .01066 te am .01078 banker .01196 bolster .01465 mt .01233 cre&te Table7: Anotherpart. 146 type 23 .083 t,tpe 22 .025 t}rpe 24 .088 opermte .23595 officer .08099 have .88786 Io~ .03589 uU .24706 shlure .17737 build .12124 phult .03029 be .02008 ohm .03026 buy .21141 stock .0972~ open .08096 profit .05795 m~r .00934 effect .02868 acquire .06323 stlke .~)~ clcoe .03310 system .04020 impa~t .02606 )urchin .04274 company , 8ct .00734 maaufa~’turo .02973 Itorl .03467 loee .00644 . ~le .02217 own .03660 intermit "03~ keep .02182 flciilty .03319 interpret .02Q82 include .03528; m .00506 suet . run .01790 operstion .02793, limit .0087? income .01999 totLl .0~347~ unit .02787 turn .01747 office .02522I lack .00488 problem .01842 hold .02227 bond .02452 eltablilb .01689 center .02376 i increue .00432 ri3bt .01540 be .01894 businm .02242 mmm.01831 c~r .01748I cite .00375 pl,-, .01473 issue .01459 security .01817 be .01426 comp4my.01672 cut .00366 time .01334 prefer .01360 opeTstion .01475 fly .0139e home .01634 ¯cld .00228 cbamce .01282~ trldo .01205 dollsr .01390 eximmd .01338 door .01471 exceed .0022Q trouble .01257 ret~n .00987 ton .01164 say .01300 mile .0U44 feel .00208 comment .01108 offer .00840 imuo .01091 mLint~dn .01149 I buiidin I .01131 coEmlder .00198 vLiuo .01099 receive .OOT02 product .00996 mmm.00194 lmme.01147 I e~,.o11 suet .01068 incromm .00684 car .00955 t~e 25 .031 t~tpe 26.054 27 ~029 VP receive .15326 IppruvLi .09040 rilml .22903 ~1~ .74167 )rovide .22624 eervtce .08067 Ip~ .14544 controct .08712 yield .13168 point .07754 offer .11291 w~y .04967 win .11622 control .04008 fLll .12219i cant .05860 chanlfe .08438 inform&tion .04529 mk .11268 order .027T5 be .08788 yea .01537 live.0e825 detadi .04282 Ialn .07135 support .02298 own .03445 ponce .01498 find .04507 hand .04176 obt~n .03024 shin .01816 drop .02992 price .00969 dis¢l~e .03263 aline .02648 live .02811 daumlqle .01668 jump .02685 ro~e .00935 Set .03251 d~t~ .02517 require .02492 licenee .01496 .024T4 year .00776 be .01960 benefit .01661 lose .02422 scce~ .01461 incremm .02322 average.00451 mink .01937 term .01436 need .02124 help .01388 climb .02222 c~t .0Q328 discmm .01320 incentive .01346 Iocee .01948 benefit ¯OI2U define .02091 bit .00276 use .01279 figure .01070 lp~mt .01426 vote .01237 g-;n .02001 demand .00283 cles.r .01147 evidemce .01069 demand.01242 t riKht .01~7 hold .01557 ton .00250 include .01115 l .01045 secure .01030 loam .01140 buy .01167 fee .00244 obtsin .01086 loan .01006 &wLrd .00975 attention .01126 a~quiro .01089 inflation .00244 need .01088 tenon .009~8 Lvmit .00788 boom: .OIOT5 8osr .0089 ran§e .00231 releeee .00901 record .00987 He 28 .023 t,~e 29 .052 t~e 30 ¯037 t~ke .71341 yesterday .08897 "tlow .05695 company .07287 report .19367 lea .17676 trade .21074 pl-~e .06415 live .04996 people .05791 poct .12246 soda . 10309 handle .00380 ~dvLntqe .05361 tell .04769 immctor .04724 ely .07210 e~rninK .09973 offer .00343 step .04657 help .04348 overnment .02802 expect .06355 profit .08199 c-q .00334 action .03613 uk .03068 customer .02600 include .04197 income .08284 include .00324 effect .03290 Z~Kluire .02360 employee .02558 operate .02780 rode .04715 m .00264 volume .03119 ssy .02168 worker .02435 show .02579 de~line .04575 find .00205 cha.r|e .02457 sttract .02134 client .01987 attribute .02379i increase .04519 otTeet .00204 control .02322 force .02101 bank .01928 follow .020491 result .04247 represent .00179 position .01873 represent .02067 sbaz~holder .01402 l~ve .01666 riee .02973 enjoy.00148 y~ur .018412 protect .01823 qency .01363 generate.01453 drop .02688 relnMJn .00146 profit .01806 leave .01803 coeeumer .01271 produce .01249 revenue .02411 ¯ -’cept .00121 look .01TTO u~o .01768 Foup .01270 be .01199 event .01580 live .00120 time .01438 Imp .01568 state .01284 .01023 net .01198 overcome .00117 risk .013345 include .01482 reporter .01234 r~fleet .01006 return .01168 note .OOU2 .01320 .01475 encouroKo court .0U83 estimate .00984 pLrt charlle .00803 t,/pe 31 o29 ~p 32 .029 ma~e .83180 decision .06421 be .11257 problem .14799 My .01986 money .03619 face .08331 issue .02183 ceJl .0o735 payment .03842 cemm.04889 pr~meuTe.02099 .00696 .03247 sense resolve .02475 d~e .02080 expect .00582 bid .02883 solve .02389 recension .01835 Lccept .00552 offer .02289 avoid .02373 chaJlenee .01664 welcome .00613 product .02105 ~cldrmm.02302 situation .01(163 olm .00451 chooKe .02097 flKht .02070 effect .01617 avoid .00400 -loan .01927 cute .02037 competition .01527 lanv* .00360 pose .01971 move .01093 thre&t .01394 ke*p .00387 profit .01682 reflect .01856 risk .01369 aE*Ct .0o315 inveetmant .01484 m .01673 COUCeru.01343 guarantee .00299 difflmlnce .01479 ,revQnt .01505 question .01327 clncei .00278 effort .01280 emm.01416 crisis .01323 defer .0o277 statement .01278 suffer .01239 sbort~qle .01284 reflect .0G250 $ .01184 cite .01236 crime .01239 Table 8:Still another part. 147 5 Application to parsing A casual examination of clusters can at most suggest that the right sort of thing is going on; the point is to use such representations to do something that we want to do for an independent reason. In the introduction, I said that resolving manyparsing ambiguities comes downto evaluating selectional compatibility betweentwo lexical items. Given a probabilistic selection model, we can assign a probability to any verb-object pair drawn from the lexicon of 992 verbs and 1027 nouns. In order to evaluate parses, we need to include other grammaticalrelations. In somecases, such as subjects, this is fairly straightforward. In others, such as second complements of verbs, getting access to frequency counts is not straightforward, since identifying second complements such as prepositional phrases -- requires resolving attachment ambiguities. This can not be done systematically without the hind of lexical information we are trying to induce. Presumably, a procedure learning selectional restrictions for second complementswould have to initially consider several attachments, and iteratively learn the lexical information required for dis~mbiguation. This method is applied to simpler data (paying attention just to prepositions and not the heads of their objects) in Hindle and Rooth [Hindle and Rooth, 1993]. To investigate the possibility of disambiguating syntactic ambiguities with selectional information, I considered the past participle vs. tensed verb ambiguity of sold in positions immediately following a noun phrase. In the first example below, sold is a tensed verb, and the preceding noun administration is the head of its subject, describing the agent. In the second example, sold is a participial post-modifier, and the preceding noun phase denotes the sold object. (ll)a. ~o6o~5evidence that the Reagan administration sold arms to Iran b. ~, represented payments for arms sold to Nicaraguan insurgents The situation is actually more complex, since sold occurs frequently in the middle construction, with a syntactic subject filling the semantic role of a sold object, rather than the seller: (12)~,7.,. Thefranchise soldin 1979for$Iimillion To sidestep thisproblem, I defined theproblem to be solved as one of identifying thesemantic relation (seller or soldobject) of thenounphrase, rather thanidentifying a grammatical relation or partof speech. In otherwords, the middleconstructions ~re grouped withthe postmodifiers. By hand,I classifiedthefirst relevant occurrence of eachnoun-sold pairin thefullWallStreet Journal corpusfrom[Liberman, 1992].Of the 1027nounsin the model,220 wererepresented in thisconfiguration, of which87 werepasttenseverbs, 118 wereparticipial postmodifiers, and 15 middleconstructions. An augmented selection modelwas obtained (in a somewhat ad hoc way)by addingtwo ad- 148 ditional verbssold/VBN and sold/VBDto the verbset.Training material 7 Themodelwas usedto classify consisted similar but independent data. the 220testexamples by meansof theratio: Psold/VBD,n Psold/VBV,n"{- Psold/VBN,tt Pairs with a score greater than 0.5 were assigned to the seller role, and those with a lower score to the sold object role. Of the 118 post-modlfler pairs, 110 were correctly classified into the sold object role. Of the 87 ordinary subjectverb items, 69 were correctly classified into the seller role. Of the 15 instances of the middle construction, 11 were correctly classified into the sold object role. This gives an rate of correct classification of 86.4%. The disambiguation scores are listed in tables 5, 5, and 5. Appositely, the least ambiguousinstance of the seller role is the noun seller. The least ambiguousex~tmple of the sold object role is output. Manyof the problematic nouns -- those with scores below 0.5 in table 5 -- nameinstitutions which can be agents, but can also be bought and sold, for instance company,airline, and 8tore. Since companies are actually described in the Wall Street Journal both as being sold and as selling things, we would not expect a selectional approach to be uniformly successful in this case. However,we want to resolve clear cases correctly m artefacts, materials, and financial instruments are unambiguossold objects, and people are unambiguoussellers. With few exceptions, such clear cases are resolved correctly. 7In writing this section, I found that the training data and my record of how they were constructed had been lost. Therefore, the evaluation will be redone, and final results may differ from those described here. 149 output .O0(X)O volume .00001 suit .01225 guolin8 .044M bill .04526 missile .08?26 movie .0?033 system .08970 technololy .09182 par~ .09475 film .09783 advertisinl .11294 mstarisl .11427 shoe .11488 oil .12449 product .12479 chip .12479 plant .12868 device .13161 merchandise .13183 |cod .13240 car .13363 machine .13475 operation .13514 barrel .13680 bulidlnl; .13923 loan .14066 document .14141 copy .14185 iold .14160 debt .14402 truck .14454 debenture .14468 bond .14543 brand .14633 share .14641 Sis .14673 food .14708 certiflclte .14839 covera4Ke .14758 inventory .14886 ~ .14924 ~t ,14957 acre .15149 phone .15157 property .15334 note .15414 version ,15509 warrant .15515 ticket .1332S block .13751 card .15806 pound .16148 business .16485 home .16962 contract .ITIO0 computer .17181 steel .17360 show .17449 amount .17630 dollar .18039 proKrmm .18111 sJrcre/t .13204 insurance .18227 enKinc .18332 book .!8543 site .19478 line .19487 land .19868 dru s .19588 security .19686 wespon .19701 test .19931 ttpe .20481 item .21203 house .21368 rest .21T06 piano .21825 station .21854 piece .22123 psper .22272 msrk .22307 import .22753 arm .23139 model .23296 mqsaJne .23470 call .24358 billion .24488 service .25488 issue .28688 work .23714 polio/ .26237 oncqD’ .20286 supply .20845 pcettion .29432 vehicle .29797 tbrih .30081 collection .30386 list .33141’ sd .33409 million .33923 |ace .34198 one .34246 shipment .38118 art .38336 article .39738 newspaper .39756 t,/pe .41898 set .43282 tronsaction .45646 hcepiteJ .55241 kind .68506 pla~er .69998 export .71760 switch .74617 snimsJ .84308 afent .75588 premium .84474 Table9: Dis~mbiguation scoresfor NP/postmodifier combinations. Items above thelinearecorrectly classified. trade .00001 effort ut&te .16620 division subsidiary .26404 account comp4my .36211 trust s/rilne .44064 institution |isnt .60438 team oper&tor .63T52 utifity industry .66728 couple country .71725 firm foreiKner .75349 room corporotion .78073 hnsband n~tion .83127 stockholder school .89880 broker producer .90640 s&ency wife .94643 ch-;rmsa a~n .96147 department people .97064 membcr employee .97603 pislntiff defendant .98917 fsrm~. -I trsder .ggT3T ot~¢i sdmin/strstion 1.00000 an&lynn found&rich 1.00000 omcer seller 1.00000 .08683 plan .11408 shop .14468 .17522 store .17609 unit .23840 .27786 network .32662 fund .32709 .39613 psrtnership .40965 concern .43030 .46481 .64480 psrent .63279 .69602 orsanins~ion .63796 insurer .64974 ba~k .664S7 .678U msmqer .68160 maker .?0603 .72093 manufacturer .?2643 shareholder .74399 .78649 lender .73983 I~’oup .77943 .80379 publisher .80629 developer .81175 .83832 st&re .87692 family .88925 .90178 individuaA .90383 executive .90404 .91013 psrtncr .93268 customer .94037 .94838 holder .98723 boa~rd .95960 .96479 f&ther .96875 dcsJer .97063 .97166 investor .97463 brotbcr .97580 .97753 world .97878 l[overnment .98305 .99021 friend .99417 l&wyer .99581 .99880 .99883 client owner 1.00000 buyer 1.00000 1.00000 director 1.00000 1.00000 psrticipant 1.00000 president 1.00000 Table10: Disarnbiguation scoresfor agent/verb combinations. Itemsbelow thelinearecorrectly classified. I o/ferin| franchise price other mctsA .14244 &psrtment .14406 stock .14880 I .13354 .15312 mcmbersbJp.16788 future .19601 scot .27562 .43832 hundred .44699 conference .49000 .37470 thin~ .65876 target .99409 run .99997 Table 11: Disambiguation scores for middle constructions. line are correctlyclassified. 15o Items above the °. References [Church et al.,199x]Church,K. W., Gale,W. A., Hanks,P., and Hindle, D. (199x). Usingstatistics in lexical analysis. In Zeruik, editor, Le~cal acquisition: e~loiting on-line resource8 to build a le.zicon. [Hindle, 1983] Hindle, D. (1983). User manual for fidditch, a deterministic parser. Technical Memorandum7590-142, Naval Research Laboratory. [Hindie, 1994]Hindie, D. (1994). A parser fortextcorpora. In Atkins, B. and Z~mpolli, A., editors, Computational Approaches to the Le~con.Oxford University Press, Oxford. [Hindle andRooth,1993]Hindle, D. and Rooth,M. (1993). Structural ambiguityandlexical relations. Computational Linguistics, 18:xx-yy. [Liberman, 1992] Liberman, M. (1992). for Computational Linguistics. 151 ACL-DCICD ROM1. Association