From: AAAI Technical Report SS-95-01. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved. Towards the Acquisitionand Representationof a Broad-Coverage Lexicon Rebecca Bruce and Janyce Wiebe Computing Research Laboratory and Department of Computer Science NewMexico StateUniversity LasCruces, NM 88003 rbruce~crl.nmsu.edu, wiebe(~cs.nmsu.edu Abstract domain theory by expressing bothin termsof conditional independence. Wealsodescribe howsucha lexStatistical techniques forNLPtypically donottake iconcanbe usedto simultaneously disambiguate all advantage ofexisting domain knowledge andrequire words in a sentence. large amounts oftagged tralnln$data. Thispaper presents a partial remedy tothese shortcomings by While theclassofmodels andthemethods forforintroducinga richer class of statistical models,graph. mulating suchmodels (specifically, techniques (i),(2), icai models,alongwith techniquesfor: (1) establishand(4))arewellknown inthefield ofstatistics, this ing the formof the modelin this class that best departicular application islargely unimplemented, and scribes a givenset of training data, (2) estimatingthe thelexicon thatwedescribe hasnotyetbeenproduced. parametersof graphical modelsfromuntaggeddata, Thecontribution ofthispaper isthepresentation ofa (3) combining constraints formulatedin propositional theoretically feasible approach to automatically acquirlogic with those derived from training data to proingandusing a broad-coverage lexicon thatintegrates duce a graphical model,and (4) simultaneouslyrebothempirically andanalytically acquired information solving interdependentambiguities. The paper also ina single formalism. describes howthese tools can be used to producea broad-coverage lexicon representedas a probabilistic Throughout thispresentation, it is assumed that model,and presents a methodfor using such a lexithere exists a predefined andfinite setofsense disconto simultaneously disambiguate all wordsin a sentinctions foreachwordtobe included inthelexicon tence. andthatthedesired level ofsyntactic analysis canbe performed asa preprocessing step. Given these restrictions, theproposed representation isnotspecific toa Introduction particular set of sense distinctions or syntactic theory Thespecification and acquisition of lexical knowledge aslongasthetwoarecompatible bothwitheachother is a central problem in natural language processing. and any domain theory to be incorporated inthelexiStatistical techniques have been used to partially aucon. tomate these tasks, but they typically do not take adWearenotpresently addressing either sense producvantage of existing domainknowledgeand require large tivity orsyntactic analysis. However, weareproposing amounts of tagged training data. These shortcomings a representation capable ofexpressing diverse types of have resulted in statistical modelsthat are too impovinformation within a single formalism, and it might be erished to capture the kind of informationthat is typpossible to incorporate theinformation necessary to ically thought to be necessary to resolve word meanaddress suchtasks within thisrepresentation. Infact, ing ambiguity. Wepresent a partial remedyto both therealready exists worksupporting sucha goal;a of these shortcomingsby introducing a richer class of mapping ofcontext freegrammars to statistical modstatistical models, graphical models, along withtechinformto themodels described herehas niques for:(I)establishing theformofthegraphical elssimilar beenproposed by Eizirik et al.(1993), andmappings model thatbestdescribes a given setoftraining data, between theories in first-order predicate logic (such (2)estimating theparameters of graphical models from toexpress senseproductivity) untagged data,(3)combining constraints formulated asmightbe required of models usedby Eizirik et al. inpropositional logic withthose derived fromtraining andthesameclass have been proposed by Charniak and Goldman (1993), datato produce a graphical model, and(4)simultaPoole (1993), andBreese (1992). neously resolving interdependent ambiguities. We describe howthese tools canbeusedtoproduce a broadTheremainder ofthispaper isorganized asfollows. models arefirst described along withtools coverage lexicon represented asa probabilistic model. Graphical forformulating them. A specific approach toproducIn thismodel, information gathered empirically from training datais integrated withknowledge present in inga broad-coverage lexicon isthenpresented, which 15 makes use of the tools described in the previous section. The paper concludes with a description of how semantic lexicaldisambiguation canbe performed usingsucha lexicon. The Tools Graphical Models Recall that the lexicon will be represented by a probabilistic model. Wemake use of a more general class of models than previously used in most NLPapplications. Modelsin this class are capable of characterizing a rich set of relationships between a large number of variables, where these variables may be spatially separated and mayalso be of different types (e.g., classbased vs. specific collocations). The models we refer to axe called "graphical models" (Whittaker 1990). The probability distribution of a graphical model corresponds to a Markov field. This is because a graphical model is a probability model for multivariate random observations whose dependency structure has a valid graphical representation, a dependency graph. The dependency graph of a model is formed by mapping each variable in the model to a node in the graph, and then drawing undirected edges between the nodes corresponding to variables that are interdependent. A valid dependency graph has the property that all variables that are not directly connected in the graph are conditionally independent given the values of the variables mapping to nodes connecting them. For example, if node A separates node B from node C in a dependency graph, then the variable mapping to node B is conditionally independent of the variable mapping to node C, given the value of the variable mapping to node A. Not only does a dependency graph provide a clear interpretation of the interactions amongvariables, it also has been shown (Pearl 1988, Lanritzen Spiegelhalter 1988) to form the basis of a distributed computational architecture for probabilistic constraint propagation. Statistical Inference Via Stochastic Simulation In this section we describe three tools for formulating and using a graphical model. Each of these tools makes use of a stochastic simulation technique to perform statistical inference, wherestatistical inference is the inference of population characteristics from a sample of training data. In all cases the inference techniques have been chosen to meet the demandsof large-scale applications by minimizing the amount of hand-tagged data required. The first tool is a methodfor selecting the form of a graphical model, given a sample of training data. The second tool is a method for estimating the parameters of a graphical model from untagged data once the form of the model is specified. The final tool is a technique for resolving multiple, mutuallyconstraining ambiguities in graphical models. 16 ModelSelection Partof the lexiconwillbe derived statistically fromtraining data.To do this,an appropriate probabilistic modelmustbe formulated. In previouswork(Bruceand Wiebe1994ab),we used methodfor modelformulation thatselectsinformativecontextual features andidentifies thedecomposablemodel(a kindof graphical model) thatis thebest approximation to thejointdistribution of wordsenses andthesefeatures. Herewe describe improvements to ourmethod thatmakelarge-scale applications feasible. Theobjective of statistical modeling is to findthe simplest modelthatfitsthe trainingdata.Rather thanmakingassumptions abouthow thevariables are related, we explore thisempirically. In ourprevious work(Bruce& Wiebe1994ab), bothfeatureselection andmodelformulation areaccomplished viaa process of hypothesis testing: patterns of interdependencies amongvariables areexpressed by decomposable models,and thenthesemodelsare evaluated as to how welltheyfitthetraining data.A modelis saidto fit thetraining dataifthedistribution it defines differs fromthetraining databy an amountConsistent with sampling variation. The difference betweenthe distribution in thetraining dataandthatdefinedby a modelcanbe measured by thelikelihood ratiostatisticG~. In orderto knowif a modelfitsthetraining data,onemustknowthe probability of randomly selecting a datasample, havinga givenG2 value,from thepopulation defined by themodel.Thesignificance of thefitof a modelis thenequalto theprobability of randomly selecting a datasamplewitha G2 value thatisas bigor bigger thanthatof thetraining data. In ourprevious work,large-sample approximations for thedistribution of G2 wereusedto determine thesignificance of a model;thismeantthatverylargedata samples wererequired. Ournewmethodfordetermining thesignificance of a modeleliminates theuseof large-sample approximationsof thedistribution of G2. Thisis accomplished bystochastically simulating theexactconditional distribution of G2 in orderto determine thesignificance of theG2 valuedescribing thefitof a model. Theexact conditional distribution of G2 foreachmodelis thedistribution of G2 values thatis observed forcomparable datasamplesrandomly generated fromthe modelbeingtested. A comparable datasample is onehavingthe samesufficient statistics as theactual training data, wherethesufficient statistics arethemarginal distributions of thevariables thatarespecified as interdependent in themodel.Algorithms existforgenerating comparable datasamplesfrommodelsdescribing conditional independence betweentwovariables (Agresti 1992,Verbeekand Kroonenberg 1985).Usingthese algorithms, thesignificance of an arbitrary graphical modelcanbe evaluated by determining thesignificance of eachconditional independence relationship in that model.Theadvantage of simulating the exactconditional distribution ofG2 to assess thestatistical sig- nificance of a model is that it eliminates (1) the need for parameter estimation in selecting the form of the model, and (2) the need to use large-sample approximations of the distribution of G2. It can, therefore, be used to establish the form of a model with only a small amount of sense-tagged data. demonstrated Estimating Model Parameters from Untagged Data Once the form of the model has been established as described above, it should be possible to obtain reliable estimates of the model parameters from untagged text. The idea is to treat the missing tags as "missing data" and draw on the wealth of statistical techniques for augmenting incomplete data given some knowledge of the form of the complete data density. The scheme we will describe is a stochastic simulation technique referred to as "The Gibbs sampler" (Geman & Geman1984); it should be noted that there are many possible variations on this general Mgorithm (Rubin1991). The procedure is a Bayesian approach to dataaugmentation thatis verysimilar in spiritto theEM algorithm(Dempster et al. 1977).The basicidea straightforward. The observed data,y, is augmented by z, themissingdata(i.e.,themissingtags),and theproblem is setup so that:(1)if themissing data, z were.known, thentheposterior densityP(01y,z), couldbe easilycalculated (where0 is the vectorof modelparameters), and (2) if the modelparameters, 0 wereknown,thentheposterior densityP(z[0,y), couldbeeasily calculated. Aninitial valueforz,zl,is assumedand theprocessiterates betweendrawing0i from P(O[y, zi) and drawing Zi+l from P(z[y, Oi). The process continues to iterate until the average value for 0, the estimate of the model parameters, converges. In the Gibbs sampler, the simulation of P(zly, O) is further simplified. This is done by partitioning the missing data as z = (Zl,Z2,...,z,~), in order make the drawing of each part of z (i.e., zi) easier. Specifically, what is done for z is to sample zl from P(zxlz2,...,z,,,O,y), z2 from P(z21zl,zs,...,z,,,O,y), and so on up to z,~. The Gibbs sampler uses some starting value for z and then generates a value for 0, followed by new values for zl through z,. Each time a new value for a partition of z, say zj, is generated, the updated values of all other partitions of z are used in selecting that value. As described above, the process iterates between generating 0 and generating zl through z, until the average value of 0 converges. The process of stochastic simulation is potentially slow. Hundreds of cycles may be required to achieve reasonable accuracy, with each cycle requiring time on the order of the sum of the numberof variables and the number of interdependencies between variables (Pearl 1988). Fortunately, the algorithm can be easily executed in parallel by concurrent processors. A further reduction in computation time can be realized by incorporating into the algorithm the simulated annealing heuristic of Kirkpatrick, Gelatt and Vecchi (1983) 17 by Geman and Geman (1984). Resolution of Interdependent Ambiguities In this section, we present a method for disambiguation of multiple words when the senses of those words are interdependent. Stated another way, we are describing a method for finding the set of word-sense classifications that, when taken as a whole, most completely satisfy the constraints among variables. Methods for doing this have been developed for various classes of models. The Viterbi algorithm (Fomey 1973) can usedtofindtheclassification sequence thatmaximizes theprobability of a Markov chain.Inthiswork,we are notrestricted to Markovchainmodelsand therefore usetechniques thatareapplicable tothebroader class of graphical models.Specifically, we use the Gibbs sampler, presented earlier, to approximate thedesired distribution. Thedifference is thatnow we knowthe valuesof themodelparameters andareonlyinterested in estimating thevaluesof themissing data.In this case,themissing dataarethesensetagsof allwordsin a sentence andthevalues of thevariables thatspecify relationships amongwordsenses. Theserelationships amongwordsensescan be formulated frompropositional logic, asdescribed inthenextsection. Combining Empirical and Analytical Constraints Thereare manydomaintheories describing relationshipsamongwordsenses,and manyof thesetheories canbe expressed in propositional logic. Inthissection, we describe methodsfor (1) representing, as conditional independence relationships, constraints thatare formulated in propositional logic, (2)assigning probabilities to theserelationships, and(3)combining such relationships withthosederived fromtraining data,to producea graphical model.Themethodsoutlined are specific to thetypeof theories usedin thisapplication.Theseareexisting concepttaxonomies, suchas WordNet (Miller 1990),thatspecify interdependencies amongthesensesof theambiguous wordsin a sentence. Formulating a probabilistic modelinvolves threemajorsub-tasks: defining the randomvariables in the model,identifying thedependencies amongthosevariables(i.e., specifying theformof themodel), andquantifying thosedependency relationships (i.e., estimating the parameters of themodel).In keepingwithprevious approaches (Bacchus 1990),propositions are, general, mappedto discrete randomvariables. An exception to thismapping is madeforpropositions corresponding to wordsenses.Eachambiguous wordwill map to a singlerandomvariable, withvaluescorresponding to the possible sensesof thatwordI. This treatment isnecessary tocapture theimplicit relationshipsthatexistbetween a wordanditssenses, i.e., word ~ wordSense I V wordSense2 V. . .V wordSensen 1The possible values for each word could also include thenullvalue corresponding totheabsence ofthatword. as wellas themutualexclusion relationship thatexistsamongthesensesof a word,i.e.,wordSensel ---* -nvordSense2 A -~wordSense3 A... A -~wordSensen. In logic, connectives, such as conjunction and implication, are used to describe relationships amongpropositions; in a probabilistic model, these relationships are expressed as statements of conditional independence among the variables. Werequire that the axioms for each proposition be covering, and that the set of axioms, as a whole, be acyclic. Given that these properties hold, all propositions (random variables) in singleaxiomcanbe treatedas interdependent, while thosepropositions nottogether in any singleaxiom canbe treated as notinterdependent. Forexample, an axiomof theformbiology ---+science, wherebiology andsciencearesubjectdesignations in a hierarchy of subject classifications, wouldbe viewedas specifyinga dependency betweenthevariables corresponding to biology and science. If thereis alsoan axiomof the form geology --} science, with geology referring to a subjectdesignation, thentherewouldalsoexista dependency betweenthevariables corresponding to geologyand science.But basedon thisinformationalone,no directdependency between geologyand biology wouldbe established. Oncethe dependencies amongvariables havebeen identified, theymustbe quantified. The theoryof Markovfields(Pearl1988,Isham1981,Darroch et al. 1980)provides a methodfor assigning completeand consistent probabilities giventhedependency structureof the modelandnumerical measures of thecompatibility of thevaluesof interdependent variables, called"compatibility measures." As described in Pearl (1988), themethodis referred to as "Gibbs potential." Theadvantages of thismethodarethatthecompatibilitymeasures needonlyquantify thelocalinteractions within thesetsof completely interdependent variables, andthatthenumerical valuesassigned to theselocal interactions neednotbe globally consistent. Compatibilitymeasures canbe assigned in accordance with thesemantics of propositional logic.Returning to the previous example ofbiology ---*science, a highcompatibility measure wouldbe assigned to theco-occurrence of biology = true and science = true, biology = false and science = false, as well as biology = false and science = true, while a low compatibility measure would be assigned to the pair biology = true and science = false. The Gibbspotential can alsobe usedto mergethe modelderived fromlogical constraints withthosedevelopedfrom the trainingdata.In mergingmultiplemodels, thecompatibility measures areformulated fromthe parameters of the modelsto be merged. cally from training data with analytically derived, domain knowledge. This lexicon can be used to disambiguate a large vocabulary of words with respect to the fine-grained sense distinctions made in a standard dictionary. The statistical techniques described in the first part of this paper can be used to automatically formulate a probabilistic modeldescribing the relationship between each ambiguous word and a select set of unambiguous contextual features, without requiring a large amount of disambiguated corpora. The individual models, formulated from training data, can then be interconnected by a constraint structure specifying the semantic relationships that exist amongthe senses of words, as described earlier. Fortunately, there are existing concept taxonomies that describe the semantic relationships amongword senses. Examples of such theories are: the "subject code" and "box code" hierarchies in the Longman’s Dictionary of Contemporary English (Procter et al. 1978), the hyponymyand meronyrny taxonomies in WordNet (Miller 1990), and the various taxonomies derived from machine readable dictionaries (Knight 1993, Bruce et ai. 1992, Vossen 1990, Chodorow et al. 1985, Michiels & Noel 1982, Arnsler 1980). In the remainder of this section, we present a specific approach to producing a broadcoverage lexicon, along with a description of howsuch a lexicon can be used to disambiguate all targeted content words in a single sentence simultaneously using the Gibbs sampler described in the previous section. Proposed Implementation Thefirstphaseof construction is to formulate, from training data,a decomposable modelforeachwordto be disambiguated. The candidate contextual features couldincludethe following: the morphology of the ambiguous word,the part-of-speech (POS)categories of theimmediately surrounding words,specific collocations, sentence position, andaspects of thephrase structure ofthesentence. Oncea setof contextual features has beenchosen, theformof the modelforeachwordis selected from theclassof decomposable models, andestimates of the modelparameters are then obtainedfrom untagged data.As mentioned earlier, selection of theformof a modelrequires a smallamountof taggeddataindicatingthe sensesof the ambiguous word.It maybe possible to further reducethetotalamountof sensetaggeddataneededforthisphaseof construction by identifying a parametric modelor modelsapplicable to a widerangeof content words,as described in our previous work(Bruce& Wiebe1994a). In thesecondphaseof construction, thestatistical modelsfortheindividual wordsareinterconnected by a constraint structure specifying the semantic relationConstruction and Applicationof a shipsthatexistamongthe sensesof words.Several Broad-CoverageLexicon concepttaxonomies appearto be applicable, andit We nowdescribe the development of a broad-coverage should bepossible to include multiple theories in a sinlexiconthatintegrates information gathered empiriglemodel,provided thattheyarebasedon the same 18 sensedistinctions. FigureI depictsa smallfraction of a dependency graphof a probabilistic modelthatcombineshypothetical but representative propositional axiomswith modelssimilar in formto thosethatwe developed from training datain previous work.As canbe seenin figurei, the constraints derived fromthetraining data areconnected to theglobalconstraint structure, which ¯ inthiscaseisderived froma subject classification hierarchy, onlyat the nodescorresponding to ambiguous words. Thesparsity of thegraph, as a whole, isa result of thelargenumberof conditional independence relationships manifest in themodel.Sucha sparsedependencygraphis computationally advantageous in that it reducesthe numberof parameters thatneedto estimatedandreducesthecomplexity of thestochastic simulation of themodel. ~ /,... arnbiguous woreL "-° of th© word to the left... prcsenccor sbscnccof plural form, values:Fu’ue Lf~c ~.-POSofthe word tothe dghtofthe ambiguous word, k_J Resolution of Ambiguity Through Simultaneous Satisfaction of all Constraints Using the model constructed as described above, the disambiguation of all ambiguous wordsin a sentence willbe accomplished viaa stochastic simulation of that modelusingtheGibbssampler. Thetagsequence identifiedforeachsentence viathestochastic simulation willbe the maximuma posteriori probability (MAP) solution to thesatisfaction of allprobabilistic dependencies amongvariables. In performing a stochastic simulation, oneis ineffect randomly sampling fromtheprobability distribution 19 definedby the modeland theknownvariable values. Duringword-sense disambiguation, thevaluesof the contextual features, whicharetheleaves of thedependencygraph,areobservable andare therefore known; thesevalues arefixedduring thestochastic simulation. As discussed earlier, thevalueof eachunobservable variable in themodel(i.e., variables whosevaluesare not known)is sequentially updatedusingthe values of themodelparameters andthecurrent valuesof the interdependent variables. Thisprocessof sequential updating (orsampling) continues untilconvergence nearconvergence in theaverage of thesampled values isachieved. Duringthesampling process, thevalueof a variable changes in response to changesin the current values ofalldependent variables (i.e., thedirectly connected variables in thedependency graph). As depicted in figurei, thesensetagof an ambiguous wordis dependent on thevalues of thecontextual features as wellas the values of thevariables in thehierarchy structure that connects allambiguous words.Whilethevaluesof the contextual features foreachinputwordareobservable andtherefore known,thevalues of thevariables in the hierarchy are notobservable. Theirvalueschangein response to changes in theprobabilities of thevarious wordsensesandthevaluesof otherhierarchical variables. If thesenses of twoormorewordsin a sentence are directlyconnectedto (dependent on) a common hierarchical variable, thentheprobability of thatvariablebeing~true"increases, whichin turnincreases theprobability of eachof theconnected wordsenses. The moreindirectthe connection (dependency) betweenthecommonhierarchical variable and the word sense,the weakerthe support.The processis much likespreading activation, buttherelationships between variables areexpressed as probabilities andtheprocess returns theMAPsolution to theconstraint satisfaction problem. Thereis onerestriction on theapplication of the Gibbssamplerto modelsformulated fromlogical constrsints. In orderfortheproofof convergence to hold, thejointdistribution of allvariables mustbe strictly 2. Thismeansthatlogical positive constraints maynot beassigned absolute certainty (i.e., givena probability of 1.0),so thatalternative valuesdo nothavea zero probability. And becausethe rateof convergence is effected by smallprobabilities (Chin& Cooper1987), theprobabilities assigned to logical constraints may haveto be adjusted to speedconvergence. Thelastissueto be addressed inthissection is the sizeof the modeland the corresponding complexity of thestochastic simulation. As described so far,the modelwouldcontainnodescorresponding to one or moreconcept ta~onornies, in addition to nodescorrespondingto a largenumberof ambiguous wordsand 2TechnicaJly, therequirement is thattheGibbssampler beirreducible, butthesimplest wayofassuring irreducibilityistoassure that thejoint distribution isstrictly positive. theircontextual features. But,as statedearlier, the timerequired for thestochastic simulation is proportional to the number of nodes plus the number of edges in the network. In order to assure that the time requirements for the stochastic simulation are not extreme, the model can be limited to only the nodes needed to disambiguate the current input sentence. This can be done by dynamically creating a network corresponding to each input sentence using "marker-passing," as described in Charniak and Goldman (1993), to select the portions of the complete model needed to process each specific input sentence. References Agresti, A. 1992.A Survey of ExactInference forContingency Tables. Statistical Science 7(I):131-177. Amsler,R. 1980.The Structureof the MerriamWebsterPocketDictionary. Technical ReportTR-164, University of Texasat Austin. Bacchus, F. 1990. Representing and Reasoning with Uncertain Knowledge. Cambridge, MA: MIT Press. Breese, J. 1992.Construction of BeliefandDecision Networks. ComputationalInteIIigence, Vol.8, No.4, pp 624--644. Bruce,R. and Wiebe,J. 1994a. A New Approach to Word-SenseDisambiguation. Proc.ARPA Human Language Technology Workshop. Plainsboro, NJ. Bruce,R. and Wiebe,J. 1994b.Word-Sense DisambiguationUsingDecomposable Models.Proceedings of the 32rd Annual Meeting of the ACL. Las Cruces, NM. Bruce, R; Wilks, Y.; Guthrie, L.; Slator, B.; and Dunning, T. 1992. NounSense - A Disambiguated Noun Taxonomy with a Sense of Humour. Memorandum in Computerand Cognitive Science, MCCS-92-246, ComputingResearchLaboratory, New MexicoState University. Charniak,E. and Goldman,R. 1993. A Bayesian modelofplanrecognition. Artificial Intelligence (64): 53-79. Chin,H. andCooper,G. 1987.Stochastic Simulation of Bayesian BeliefNetworks. Procs.3rdWorkshop on Uncertainty in AI,Seattle, WA.106-113. Chodorow, M., Byrd,R., and Heidorn, G. 1985.Extracting Semantic Hierarchies froma LargeOn-Line Dictionary. Proceedings of the 23rd Annual Meeting of the ACL, Chicago, IL, 299-304. Darroch, J.; Lauritzen, S.; and Speed, T. 1980. MarkovFieldsand Log-Linear Interaction Modelsfor Contingency Tables.Annalsof Statistics 8(3):522539. Dempster, A.; Laird,N.; and Rubin,D. 1977.Maximum Likelihood from IncompleteDataVia the EM Algorithm. Journal of the Royal Statistical Society B (39): 1-38. 2O Eizirik, L.; Barbosa, V.; and Mendes, S. 1993. A Bayesian-Network Approach to Lexical Disambiguation. Cognitive Science (17): 257-283. Forney, G. D. 1973. The Viterbi Algorithm. Proc. IEEE (6): 268-278. Geman, S. and Geman, D. 1984. Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and MachineIntelligence (6): 721-741. Isham, V. 1981. An Introduction to Spatial Processes and Markov RandomFields. International Statistical Review (49): 21-43. Kirkpatrick, S.; Gelatt, C.; and Vecchi, M. 1983. Optimization by Simulated Annealing. Science (220) 671-680. Knight, K. 1993. Building a Large Ontology for Machine Translation. Proc. ARPA Human Language Technology Workshop. Plainsboro, NJ. Lauritzen, S. and Spiegelhalter, D. 1988. Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems. J. R. Statist. Soc. B 50(2): 157-224. Michiels, A., and Noel, J. 1982. Approaches to Thesaurus Production. Proceedings of the 9th International Conference on Computational Linguistics (COLING-82), Prague, Czechoslovakia, 227-232. Miller, G. 1990. Special Issue, WordNet: An on-line lexical database. International Journal of Lexicography 3(4). Pearl, J. 1988. Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. SanMateo, Ca.: Morgan Kaufmann. Procter, Paul et al. 1978. Longman Dictionary of Contemporary English. Poole, D. 1993. Probabilistic Horn abduction and Bayesian Networks. Artificial Intelligence (64): 81129. Rubin, D. 1991. EM and Beyond. Psychometrika 56(2): 241-254. Verbeek, A. and Kroonenberg, P. 1985. A survey of algorithms for exact distributions of the test statistics in r x c contingency tables with fixed marginals. Computational Statistics and Data Analysis (3): 159-185. Vossen, P. 1990. The End of the Chain: Where Does Decomposition of Lexical Knowledge Lead Us Eventuaily? Proceedings of the 4th Conference on Functional Grammar, Kopenhagen. Whittaker, J. 1990. Graphical Models In Applied Multivariate Statistics. NewYork, NY: John Wiley & Sons.