Towards the Acquisition and Representation of a Broad-Coverage Lexicon

advertisement
From: AAAI Technical Report SS-95-01. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Towards the Acquisitionand Representationof a Broad-Coverage
Lexicon
Rebecca Bruce and Janyce Wiebe
Computing
Research
Laboratory
and
Department
of Computer
Science
NewMexico
StateUniversity
LasCruces,
NM 88003
rbruce~crl.nmsu.edu,
wiebe(~cs.nmsu.edu
Abstract
domain
theory
by expressing
bothin termsof conditional
independence.
Wealsodescribe
howsucha lexStatistical
techniques
forNLPtypically
donottake
iconcanbe usedto simultaneously
disambiguate
all
advantage
ofexisting
domain
knowledge
andrequire
words
in
a
sentence.
large
amounts
oftagged
tralnln$data.
Thispaper
presents
a partial
remedy
tothese
shortcomings
by
While
theclassofmodels
andthemethods
forforintroducinga richer class of statistical models,graph.
mulating
suchmodels
(specifically,
techniques
(i),(2),
icai models,alongwith techniquesfor: (1) establishand(4))arewellknown
inthefield
ofstatistics,
this
ing the formof the modelin this class that best departicular
application
islargely
unimplemented,
and
scribes a givenset of training data, (2) estimatingthe
thelexicon
thatwedescribe
hasnotyetbeenproduced.
parametersof graphical modelsfromuntaggeddata,
Thecontribution
ofthispaper
isthepresentation
ofa
(3) combining
constraints formulatedin propositional
theoretically
feasible
approach
to
automatically
acquirlogic with those derived from training data to proingandusing
a broad-coverage
lexicon
thatintegrates
duce a graphical model,and (4) simultaneouslyrebothempirically
andanalytically
acquired
information
solving interdependentambiguities. The paper also
ina single
formalism.
describes howthese tools can be used to producea
broad-coverage
lexicon representedas a probabilistic
Throughout
thispresentation,
it is assumed
that
model,and presents a methodfor using such a lexithere
exists
a predefined
andfinite
setofsense
disconto simultaneously
disambiguate
all wordsin a sentinctions
foreachwordtobe included
inthelexicon
tence.
andthatthedesired
level
ofsyntactic
analysis
canbe
performed
asa preprocessing
step.
Given
these
restrictions,
theproposed
representation
isnotspecific
toa
Introduction
particular
set
of
sense
distinctions
or
syntactic
theory
Thespecification and acquisition of lexical knowledge
aslongasthetwoarecompatible
bothwitheachother
is a central problem in natural language processing.
and
any
domain
theory
to
be
incorporated
inthelexiStatistical techniques have been used to partially aucon.
tomate these tasks, but they typically do not take adWearenotpresently
addressing
either
sense
producvantage of existing domainknowledgeand require large
tivity
orsyntactic
analysis.
However,
weareproposing
amounts of tagged training data. These shortcomings
a representation
capable
ofexpressing
diverse
types
of
have resulted in statistical modelsthat are too impovinformation
within
a
single
formalism,
and
it
might
be
erished to capture the kind of informationthat is typpossible
to incorporate
theinformation
necessary
to
ically thought to be necessary to resolve word meanaddress
suchtasks
within
thisrepresentation.
Infact,
ing ambiguity. Wepresent a partial remedyto both
therealready
exists
worksupporting
sucha goal;a
of these shortcomingsby introducing a richer class of
mapping
ofcontext
freegrammars
to statistical
modstatistical
models,
graphical
models,
along
withtechinformto themodels
described
herehas
niques
for:(I)establishing
theformofthegraphical elssimilar
beenproposed
by Eizirik
et al.(1993),
andmappings
model
thatbestdescribes
a given
setoftraining
data,
between
theories
in
first-order
predicate
logic
(such
(2)estimating
theparameters
of graphical
models
from
toexpress
senseproductivity)
untagged
data,(3)combining
constraints
formulated asmightbe required
of models
usedby Eizirik
et al.
inpropositional
logic
withthose
derived
fromtraining andthesameclass
have
been
proposed
by
Charniak
and
Goldman
(1993),
datato produce
a graphical
model,
and(4)simultaPoole
(1993),
andBreese
(1992).
neously
resolving
interdependent
ambiguities.
We describe
howthese
tools
canbeusedtoproduce
a broadTheremainder
ofthispaper
isorganized
asfollows.
models
arefirst
described
along
withtools
coverage
lexicon
represented
asa probabilistic
model. Graphical
forformulating
them.
A specific
approach
toproducIn thismodel,
information
gathered
empirically
from
training
datais integrated
withknowledge
present
in
inga broad-coverage
lexicon
isthenpresented,
which
15
makes use of the tools described in the previous section. The paper concludes with a description of how
semantic
lexicaldisambiguation
canbe performed
usingsucha lexicon.
The
Tools
Graphical
Models
Recall that the lexicon will be represented by a probabilistic model. Wemake use of a more general class
of models than previously used in most NLPapplications. Modelsin this class are capable of characterizing
a rich set of relationships between a large number of
variables, where these variables may be spatially separated and mayalso be of different types (e.g., classbased vs. specific collocations). The models we refer
to axe called "graphical models" (Whittaker 1990).
The probability distribution of a graphical model
corresponds to a Markov field. This is because a
graphical model is a probability model for multivariate
random observations whose dependency structure has
a valid graphical representation, a dependency graph.
The dependency graph of a model is formed by mapping each variable in the model to a node in the graph,
and then drawing undirected edges between the nodes
corresponding to variables that are interdependent. A
valid dependency graph has the property that all variables that are not directly connected in the graph are
conditionally independent given the values of the variables mapping to nodes connecting them. For example,
if node A separates node B from node C in a dependency graph, then the variable mapping to node B is
conditionally independent of the variable mapping to
node C, given the value of the variable mapping to
node A. Not only does a dependency graph provide
a clear interpretation of the interactions amongvariables, it also has been shown (Pearl 1988, Lanritzen
Spiegelhalter 1988) to form the basis of a distributed
computational architecture for probabilistic constraint
propagation.
Statistical
Inference
Via Stochastic
Simulation
In this section we describe three tools for formulating
and using a graphical model. Each of these tools makes
use of a stochastic simulation technique to perform statistical inference, wherestatistical inference is the inference of population characteristics from a sample of
training data. In all cases the inference techniques have
been chosen to meet the demandsof large-scale applications by minimizing the amount of hand-tagged data
required. The first tool is a methodfor selecting the
form of a graphical model, given a sample of training data. The second tool is a method for estimating
the parameters of a graphical model from untagged
data once the form of the model is specified. The final
tool is a technique for resolving multiple, mutuallyconstraining ambiguities in graphical models.
16
ModelSelection
Partof the lexiconwillbe derived
statistically
fromtraining
data.To do this,an appropriate
probabilistic
modelmustbe formulated.
In
previouswork(Bruceand Wiebe1994ab),we used
methodfor modelformulation
thatselectsinformativecontextual
features
andidentifies
thedecomposablemodel(a kindof graphical
model)
thatis thebest
approximation
to thejointdistribution
of wordsenses
andthesefeatures.
Herewe describe
improvements
to
ourmethod
thatmakelarge-scale
applications
feasible.
Theobjective
of statistical
modeling
is to findthe
simplest
modelthatfitsthe trainingdata.Rather
thanmakingassumptions
abouthow thevariables
are
related,
we explore
thisempirically.
In ourprevious
work(Bruce& Wiebe1994ab),
bothfeatureselection
andmodelformulation
areaccomplished
viaa process
of hypothesis
testing:
patterns
of interdependencies
amongvariables
areexpressed
by decomposable
models,and thenthesemodelsare evaluated
as to how
welltheyfitthetraining
data.A modelis saidto fit
thetraining
dataifthedistribution
it defines
differs
fromthetraining
databy an amountConsistent
with
sampling
variation.
The difference
betweenthe distribution
in thetraining
dataandthatdefinedby a
modelcanbe measured
by thelikelihood
ratiostatisticG~. In orderto knowif a modelfitsthetraining
data,onemustknowthe probability
of randomly
selecting
a datasample,
havinga givenG2 value,from
thepopulation
defined
by themodel.Thesignificance
of thefitof a modelis thenequalto theprobability
of randomly
selecting
a datasamplewitha G2 value
thatisas bigor bigger
thanthatof thetraining
data.
In ourprevious
work,large-sample
approximations
for
thedistribution
of G2 wereusedto determine
thesignificance
of a model;thismeantthatverylargedata
samples
wererequired.
Ournewmethodfordetermining
thesignificance
of
a modeleliminates
theuseof large-sample
approximationsof thedistribution
of G2. Thisis accomplished
bystochastically
simulating
theexactconditional
distribution
of G2 in orderto determine
thesignificance
of theG2 valuedescribing
thefitof a model.
Theexact
conditional
distribution
of G2 foreachmodelis thedistribution
of G2 values
thatis observed
forcomparable
datasamplesrandomly
generated
fromthe modelbeingtested.
A comparable
datasample
is onehavingthe
samesufficient
statistics
as theactual
training
data,
wherethesufficient
statistics
arethemarginal
distributions
of thevariables
thatarespecified
as interdependent
in themodel.Algorithms
existforgenerating
comparable
datasamplesfrommodelsdescribing
conditional
independence
betweentwovariables
(Agresti
1992,Verbeekand Kroonenberg
1985).Usingthese
algorithms,
thesignificance
of an arbitrary
graphical
modelcanbe evaluated
by determining
thesignificance
of eachconditional
independence
relationship
in that
model.Theadvantage
of simulating
the exactconditional
distribution
ofG2 to assess
thestatistical
sig-
nificance of a model is that it eliminates (1) the need
for parameter estimation in selecting the form of the
model, and (2) the need to use large-sample approximations of the distribution of G2. It can, therefore, be
used to establish the form of a model with only a small
amount of sense-tagged data.
demonstrated
Estimating
Model Parameters
from Untagged
Data Once the form of the model has been established as described above, it should be possible to obtain reliable estimates of the model parameters from
untagged text. The idea is to treat the missing tags
as "missing data" and draw on the wealth of statistical techniques for augmenting incomplete data given
some knowledge of the form of the complete data density. The scheme we will describe is a stochastic simulation technique referred to as "The Gibbs sampler"
(Geman & Geman1984); it should be noted that there
are many possible variations on this general Mgorithm
(Rubin1991).
The procedure
is a Bayesian
approach
to dataaugmentation
thatis verysimilar
in spiritto theEM algorithm(Dempster
et al. 1977).The basicidea
straightforward.
The observed
data,y, is augmented
by z, themissingdata(i.e.,themissingtags),and
theproblem
is setup so that:(1)if themissing
data,
z were.known,
thentheposterior
densityP(01y,z),
couldbe easilycalculated
(where0 is the vectorof
modelparameters),
and (2) if the modelparameters,
0 wereknown,thentheposterior
densityP(z[0,y),
couldbeeasily
calculated.
Aninitial
valueforz,zl,is
assumedand theprocessiterates
betweendrawing0i
from P(O[y, zi) and drawing Zi+l from P(z[y, Oi). The
process continues to iterate until the average value for
0, the estimate of the model parameters, converges.
In the Gibbs sampler, the simulation of P(zly, O)
is further simplified. This is done by partitioning
the missing data as z = (Zl,Z2,...,z,~),
in order
make the drawing of each part of z (i.e., zi) easier.
Specifically, what is done for z is to sample zl from
P(zxlz2,...,z,,,O,y),
z2 from P(z21zl,zs,...,z,,,O,y),
and so on up to z,~. The Gibbs sampler uses some
starting value for z and then generates a value for 0,
followed by new values for zl through z,. Each time
a new value for a partition of z, say zj, is generated,
the updated values of all other partitions of z are used
in selecting that value. As described above, the process iterates between generating 0 and generating zl
through z, until the average value of 0 converges.
The process of stochastic simulation is potentially
slow. Hundreds of cycles may be required to achieve
reasonable accuracy, with each cycle requiring time on
the order of the sum of the numberof variables and the
number of interdependencies between variables (Pearl
1988). Fortunately, the algorithm can be easily executed in parallel by concurrent processors. A further
reduction in computation time can be realized by incorporating into the algorithm the simulated annealing
heuristic of Kirkpatrick, Gelatt and Vecchi (1983)
17
by Geman and Geman (1984).
Resolution
of Interdependent
Ambiguities
In
this section, we present a method for disambiguation
of multiple words when the senses of those words are
interdependent. Stated another way, we are describing
a method for finding the set of word-sense classifications that, when taken as a whole, most completely
satisfy the constraints among variables. Methods for
doing this have been developed for various classes of
models. The Viterbi algorithm (Fomey 1973) can
usedtofindtheclassification
sequence
thatmaximizes
theprobability
of a Markov
chain.Inthiswork,we are
notrestricted
to Markovchainmodelsand therefore
usetechniques
thatareapplicable
tothebroader
class
of graphical
models.Specifically,
we use the Gibbs
sampler,
presented
earlier,
to approximate
thedesired
distribution.
Thedifference
is thatnow we knowthe
valuesof themodelparameters
andareonlyinterested
in estimating
thevaluesof themissing
data.In this
case,themissing
dataarethesensetagsof allwordsin
a sentence
andthevalues
of thevariables
thatspecify
relationships
amongwordsenses.
Theserelationships
amongwordsensescan be formulated
frompropositional
logic,
asdescribed
inthenextsection.
Combining Empirical and Analytical
Constraints
Thereare manydomaintheories
describing
relationshipsamongwordsenses,and manyof thesetheories
canbe expressed
in propositional
logic.
Inthissection,
we describe
methodsfor (1) representing,
as conditional
independence
relationships,
constraints
thatare
formulated
in propositional
logic,
(2)assigning
probabilities
to theserelationships,
and(3)combining
such
relationships
withthosederived
fromtraining
data,to
producea graphical
model.Themethodsoutlined
are
specific
to thetypeof theories
usedin thisapplication.Theseareexisting
concepttaxonomies,
suchas
WordNet
(Miller
1990),thatspecify
interdependencies
amongthesensesof theambiguous
wordsin a sentence.
Formulating
a probabilistic
modelinvolves
threemajorsub-tasks:
defining
the randomvariables
in the
model,identifying
thedependencies
amongthosevariables(i.e.,
specifying
theformof themodel),
andquantifying
thosedependency
relationships
(i.e.,
estimating
the parameters
of themodel).In keepingwithprevious approaches
(Bacchus
1990),propositions
are,
general,
mappedto discrete
randomvariables.
An exception
to thismapping
is madeforpropositions
corresponding
to wordsenses.Eachambiguous
wordwill
map to a singlerandomvariable,
withvaluescorresponding
to the possible
sensesof thatwordI. This
treatment
isnecessary
tocapture
theimplicit
relationshipsthatexistbetween
a wordanditssenses,
i.e.,
word ~ wordSense I V wordSense2 V. . .V wordSensen
1The possible values for each word could also include
thenullvalue
corresponding
totheabsence
ofthatword.
as wellas themutualexclusion
relationship
thatexistsamongthesensesof a word,i.e.,wordSensel
---*
-nvordSense2 A -~wordSense3 A... A -~wordSensen.
In logic, connectives, such as conjunction and implication, are used to describe relationships amongpropositions; in a probabilistic model, these relationships are
expressed as statements of conditional independence
among the variables. Werequire that the axioms for
each proposition be covering, and that the set of axioms, as a whole, be acyclic. Given that these properties hold, all propositions (random variables) in
singleaxiomcanbe treatedas interdependent,
while
thosepropositions
nottogether
in any singleaxiom
canbe treated
as notinterdependent.
Forexample,
an
axiomof theformbiology
---+science,
wherebiology
andsciencearesubjectdesignations
in a hierarchy
of subject
classifications,
wouldbe viewedas specifyinga dependency
betweenthevariables
corresponding
to biology
and science.
If thereis alsoan axiomof
the form geology --} science, with geology referring
to a subjectdesignation,
thentherewouldalsoexista dependency
betweenthevariables
corresponding
to geologyand science.But basedon thisinformationalone,no directdependency
between
geologyand
biology
wouldbe established.
Oncethe dependencies
amongvariables
havebeen
identified,
theymustbe quantified.
The theoryof
Markovfields(Pearl1988,Isham1981,Darroch
et al.
1980)provides
a methodfor assigning
completeand
consistent
probabilities
giventhedependency
structureof the modelandnumerical
measures
of thecompatibility
of thevaluesof interdependent
variables,
called"compatibility
measures."
As described
in Pearl
(1988),
themethodis referred
to as "Gibbs
potential."
Theadvantages
of thismethodarethatthecompatibilitymeasures
needonlyquantify
thelocalinteractions
within
thesetsof completely
interdependent
variables,
andthatthenumerical
valuesassigned
to theselocal
interactions
neednotbe globally
consistent.
Compatibilitymeasures
canbe assigned
in accordance
with
thesemantics
of propositional
logic.Returning
to the
previous
example
ofbiology
---*science,
a highcompatibility
measure
wouldbe assigned
to theco-occurrence
of biology = true and science = true, biology = false
and science = false, as well as biology = false and
science = true, while a low compatibility measure
would be assigned to the pair biology = true and
science = false.
The Gibbspotential
can alsobe usedto mergethe
modelderived
fromlogical
constraints
withthosedevelopedfrom the trainingdata.In mergingmultiplemodels,
thecompatibility
measures
areformulated
fromthe parameters
of the modelsto be merged.
cally from training data with analytically derived, domain knowledge. This lexicon can be used to disambiguate a large vocabulary of words with respect to
the fine-grained sense distinctions made in a standard
dictionary. The statistical techniques described in the
first part of this paper can be used to automatically
formulate a probabilistic modeldescribing the relationship between each ambiguous word and a select set of
unambiguous contextual features, without requiring a
large amount of disambiguated corpora. The individual models, formulated from training data, can then
be interconnected by a constraint structure specifying
the semantic relationships that exist amongthe senses
of words, as described earlier. Fortunately, there are
existing concept taxonomies that describe the semantic relationships amongword senses. Examples of such
theories are: the "subject code" and "box code" hierarchies in the Longman’s Dictionary of Contemporary English (Procter et al. 1978), the hyponymyand
meronyrny taxonomies in WordNet (Miller 1990), and
the various taxonomies derived from machine readable dictionaries (Knight 1993, Bruce et ai. 1992,
Vossen 1990, Chodorow et al. 1985, Michiels & Noel
1982, Arnsler 1980). In the remainder of this section,
we present a specific approach to producing a broadcoverage lexicon, along with a description of howsuch a
lexicon can be used to disambiguate all targeted content words in a single sentence simultaneously using
the Gibbs sampler described in the previous section.
Proposed Implementation
Thefirstphaseof construction
is to formulate,
from
training
data,a decomposable
modelforeachwordto
be disambiguated.
The candidate
contextual
features
couldincludethe following:
the morphology
of the
ambiguous
word,the part-of-speech
(POS)categories
of theimmediately
surrounding
words,specific
collocations,
sentence
position,
andaspects
of thephrase
structure
ofthesentence.
Oncea setof contextual
features
has beenchosen,
theformof the modelforeachwordis selected
from
theclassof decomposable
models,
andestimates
of the
modelparameters
are then obtainedfrom untagged
data.As mentioned
earlier,
selection
of theformof
a modelrequires
a smallamountof taggeddataindicatingthe sensesof the ambiguous
word.It maybe
possible
to further
reducethetotalamountof sensetaggeddataneededforthisphaseof construction
by
identifying
a parametric
modelor modelsapplicable
to a widerangeof content
words,as described
in our
previous
work(Bruce& Wiebe1994a).
In thesecondphaseof construction,
thestatistical
modelsfortheindividual
wordsareinterconnected
by
a
constraint
structure
specifying
the
semantic
relationConstruction
and Applicationof a
shipsthatexistamongthe sensesof words.Several
Broad-CoverageLexicon
concepttaxonomies
appearto be applicable,
andit
We nowdescribe
the development
of a broad-coverage should
bepossible
to include
multiple
theories
in a sinlexiconthatintegrates
information
gathered
empiriglemodel,provided
thattheyarebasedon the same
18
sensedistinctions.
FigureI depictsa smallfraction
of a dependency
graphof a probabilistic
modelthatcombineshypothetical
but representative
propositional
axiomswith
modelssimilar
in formto thosethatwe developed
from
training
datain previous
work.As canbe seenin figurei, the constraints
derived
fromthetraining
data
areconnected
to theglobalconstraint
structure,
which
¯ inthiscaseisderived
froma subject
classification
hierarchy,
onlyat the nodescorresponding
to ambiguous
words.
Thesparsity
of thegraph,
as a whole,
isa result
of thelargenumberof conditional
independence
relationships
manifest
in themodel.Sucha sparsedependencygraphis computationally
advantageous
in that
it reducesthe numberof parameters
thatneedto estimatedandreducesthecomplexity
of thestochastic
simulation
of themodel.
~ /,... arnbiguous woreL
"-°
of th©
word
to the
left...
prcsenccor
sbscnccof
plural form,
values:Fu’ue
Lf~c
~.-POSofthe
word
tothe
dghtofthe
ambiguous
word,
k_J
Resolution
of Ambiguity
Through
Simultaneous
Satisfaction
of all
Constraints
Using the model constructed as described above, the
disambiguation
of all ambiguous
wordsin a sentence
willbe accomplished
viaa stochastic
simulation
of that
modelusingtheGibbssampler.
Thetagsequence
identifiedforeachsentence
viathestochastic
simulation
willbe the maximuma posteriori
probability
(MAP)
solution
to thesatisfaction
of allprobabilistic
dependencies
amongvariables.
In performing
a stochastic
simulation,
oneis ineffect
randomly
sampling
fromtheprobability
distribution
19
definedby the modeland theknownvariable
values.
Duringword-sense
disambiguation,
thevaluesof the
contextual
features,
whicharetheleaves
of thedependencygraph,areobservable
andare therefore
known;
thesevalues
arefixedduring
thestochastic
simulation.
As discussed
earlier,
thevalueof eachunobservable
variable
in themodel(i.e.,
variables
whosevaluesare
not known)is sequentially
updatedusingthe values
of themodelparameters
andthecurrent
valuesof the
interdependent
variables.
Thisprocessof sequential
updating
(orsampling)
continues
untilconvergence
nearconvergence
in theaverage
of thesampled
values
isachieved.
Duringthesampling
process,
thevalueof a variable
changes
in response
to changesin the current
values
ofalldependent
variables
(i.e.,
thedirectly
connected
variables
in thedependency
graph).
As depicted
in figurei, thesensetagof an ambiguous
wordis dependent
on thevalues
of thecontextual
features
as wellas the
values
of thevariables
in thehierarchy
structure
that
connects
allambiguous
words.Whilethevaluesof the
contextual
features
foreachinputwordareobservable
andtherefore
known,thevalues
of thevariables
in the
hierarchy
are notobservable.
Theirvalueschangein
response
to changes
in theprobabilities
of thevarious
wordsensesandthevaluesof otherhierarchical
variables.
If thesenses
of twoormorewordsin a sentence
are directlyconnectedto (dependent
on) a common
hierarchical
variable,
thentheprobability
of thatvariablebeing~true"increases,
whichin turnincreases
theprobability
of eachof theconnected
wordsenses.
The moreindirectthe connection
(dependency)
betweenthecommonhierarchical
variable
and the word
sense,the weakerthe support.The processis much
likespreading
activation,
buttherelationships
between
variables
areexpressed
as probabilities
andtheprocess
returns
theMAPsolution
to theconstraint
satisfaction
problem.
Thereis onerestriction
on theapplication
of the
Gibbssamplerto modelsformulated
fromlogical
constrsints.
In orderfortheproofof convergence
to hold,
thejointdistribution
of allvariables
mustbe strictly
2. Thismeansthatlogical
positive
constraints
maynot
beassigned
absolute
certainty
(i.e.,
givena probability
of 1.0),so thatalternative
valuesdo nothavea zero
probability.
And becausethe rateof convergence
is
effected
by smallprobabilities
(Chin& Cooper1987),
theprobabilities
assigned
to logical
constraints
may
haveto be adjusted
to speedconvergence.
Thelastissueto be addressed
inthissection
is the
sizeof the modeland the corresponding
complexity
of thestochastic
simulation.
As described
so far,the
modelwouldcontainnodescorresponding
to one or
moreconcept
ta~onornies,
in addition
to nodescorrespondingto a largenumberof ambiguous
wordsand
2TechnicaJly,
therequirement
is thattheGibbssampler
beirreducible,
butthesimplest
wayofassuring
irreducibilityistoassure
that
thejoint
distribution
isstrictly
positive.
theircontextual
features. But,as statedearlier, the
timerequired for thestochastic
simulation
is proportional to the number of nodes plus the number of
edges in the network. In order to assure that the
time requirements for the stochastic simulation are
not extreme, the model can be limited to only the
nodes needed to disambiguate the current input sentence. This can be done by dynamically creating a
network corresponding to each input sentence using
"marker-passing," as described in Charniak and Goldman (1993), to select the portions of the complete
model needed to process each specific input sentence.
References
Agresti,
A. 1992.A Survey
of ExactInference
forContingency
Tables.
Statistical
Science
7(I):131-177.
Amsler,R. 1980.The Structureof the MerriamWebsterPocketDictionary.
Technical
ReportTR-164,
University
of Texasat Austin.
Bacchus, F. 1990. Representing and Reasoning with
Uncertain Knowledge. Cambridge, MA: MIT Press.
Breese,
J. 1992.Construction
of BeliefandDecision
Networks.
ComputationalInteIIigence,
Vol.8, No.4,
pp 624--644.
Bruce,R. and Wiebe,J. 1994a. A New Approach
to Word-SenseDisambiguation.
Proc.ARPA Human
Language
Technology
Workshop.
Plainsboro,
NJ.
Bruce,R. and Wiebe,J. 1994b.Word-Sense
DisambiguationUsingDecomposable
Models.Proceedings
of the 32rd Annual Meeting of the ACL. Las Cruces,
NM.
Bruce, R; Wilks, Y.; Guthrie, L.; Slator, B.; and Dunning, T. 1992. NounSense - A Disambiguated Noun
Taxonomy with a Sense of Humour. Memorandum
in Computerand Cognitive Science, MCCS-92-246,
ComputingResearchLaboratory,
New MexicoState
University.
Charniak,E. and Goldman,R. 1993. A Bayesian
modelofplanrecognition.
Artificial
Intelligence
(64):
53-79.
Chin,H. andCooper,G. 1987.Stochastic
Simulation
of Bayesian
BeliefNetworks.
Procs.3rdWorkshop
on
Uncertainty
in AI,Seattle,
WA.106-113.
Chodorow,
M., Byrd,R., and Heidorn,
G. 1985.Extracting
Semantic
Hierarchies
froma LargeOn-Line
Dictionary. Proceedings of the 23rd Annual Meeting
of the ACL, Chicago, IL, 299-304.
Darroch, J.; Lauritzen, S.; and Speed, T. 1980.
MarkovFieldsand Log-Linear
Interaction
Modelsfor
Contingency
Tables.Annalsof Statistics
8(3):522539.
Dempster,
A.; Laird,N.; and Rubin,D. 1977.Maximum Likelihood
from IncompleteDataVia the EM
Algorithm. Journal of the Royal Statistical Society B
(39): 1-38.
2O
Eizirik, L.; Barbosa, V.; and Mendes, S. 1993. A
Bayesian-Network Approach to Lexical Disambiguation. Cognitive Science (17): 257-283.
Forney, G. D. 1973. The Viterbi Algorithm. Proc.
IEEE (6): 268-278.
Geman, S. and Geman, D. 1984. Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and MachineIntelligence (6): 721-741.
Isham, V. 1981. An Introduction to Spatial Processes
and Markov RandomFields. International Statistical
Review (49): 21-43.
Kirkpatrick, S.; Gelatt, C.; and Vecchi, M. 1983.
Optimization by Simulated Annealing. Science (220)
671-680.
Knight, K. 1993. Building a Large Ontology for
Machine Translation.
Proc. ARPA Human Language
Technology Workshop. Plainsboro, NJ.
Lauritzen, S. and Spiegelhalter, D. 1988. Local Computations with Probabilities on Graphical Structures
and their Application to Expert Systems. J. R.
Statist. Soc. B 50(2): 157-224.
Michiels, A., and Noel, J. 1982. Approaches to
Thesaurus Production. Proceedings of the 9th International Conference on Computational Linguistics
(COLING-82), Prague, Czechoslovakia, 227-232.
Miller, G. 1990. Special Issue, WordNet: An on-line
lexical database. International Journal of Lexicography 3(4).
Pearl, J. 1988. Probabilistic Reasoning In Intelligent
Systems: Networks of Plausible Inference. SanMateo,
Ca.: Morgan Kaufmann.
Procter, Paul et al. 1978. Longman Dictionary of
Contemporary English.
Poole, D. 1993. Probabilistic
Horn abduction and
Bayesian Networks. Artificial Intelligence (64): 81129.
Rubin, D. 1991. EM and Beyond. Psychometrika
56(2): 241-254.
Verbeek, A. and Kroonenberg, P. 1985. A survey of
algorithms for exact distributions of the test statistics
in r x c contingency tables with fixed marginals. Computational Statistics and Data Analysis (3): 159-185.
Vossen, P. 1990. The End of the Chain: Where Does
Decomposition of Lexical Knowledge Lead Us Eventuaily? Proceedings of the 4th Conference on Functional Grammar, Kopenhagen.
Whittaker, J. 1990. Graphical Models In Applied Multivariate Statistics.
NewYork, NY: John Wiley &
Sons.
Download