MISSING ARGUMENT REFERENT IDENTIFICATION IN NATURAL LANGUAGE
by
Jeffrey Foran
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the Degrees of
Bachelor of Science in Computer Science and Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February, 1999
@ Jeffrey Foran, MCMXCIX. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis
document in whole or in part, and to grant others the right to do so.
Author
and Computer Science
February 3, 1999
UU
Certified by
(
Clifford J. Weinstein
Group Leader, MIT Lincoln Laboratory
Thesis Supervisor
Certified by
77
I,
V
Young-Suk Lee
atory
visor
Accepted by
Smith
Chairman, Departmental Committee on Graduate Studies
MISSING ARGUMENT REFERENT IDENTIFICATION IN NATURAL LANGUAGE
by
Jeffrey Foran
Submitted to the Department of Electrical Engineering and Computer Science
on February 3, 1999, in partial fulfillment of the requirements of the
degrees of Bachelor of Science in Computer Science and
Master of Engineering in Electrical Engineering and Computer Science
ABSTRACT
Missing arguments are common in spoken language and in telegraphic messages. Identifying the
correct linguistic interpretation of a sentence with a missing argument often requires identifying a
coreferent of that missing argument.
Consequently, automated natural language understanding
systems that do not correctly identify a replacement for missing arguments within a sentence will
often derive an incorrect semantic or pragmatic interpretation of that sentence. The solution lies
in identifying a correct referent of the missing argument and replacing the missing argument with
that referent before generating the outputted semantic representation.
The technical challenge
remains in calculating the appropriate referent of the missing argument.
This thesis describes
methods for automatically identifying coreferents of missing arguments within a discourse before
the completion of semantic and pragmatic evaluation, allowing for the insertion of these referents
into the discourse in a manner which resolves ambiguities while minimizing the number of
inaccurate insertions.
Applying these methods to a discourse with multiple missing arguments
results in a discourse that is more comprehensible to both man and machine.
THESIS SUPERVISOR:
Clifford J. Weinstein
TITLE:
Group Leader, MIT Lincoln Laboratory
THESIS SUPERVISOR:
Young-Suk Lee
Staff, MIT Lincoln Laboratory
TITLE:
2
ACKNOWLEDGMENTS
I would first like to thank Young-Suk Lee, my thesis advisor who has directly guided me
through the past year of my research. The many discussions that we have had over this last year
concerning my research have been invaluable to my experience as a student. Her drive and her
love of her work has rubbed off on me, and I now look forward to working in the exciting field of
computational linguistics.
Secondly, I would like to thank Clifford Weinstein, without whom I would never have
been able to work on this project.
I would also like to thank Linda Kukolich, who has provided an invaluable and
unforgettable atmosphere during the past several months as an office mate. I could not hope for a
more congenial and understanding friend to share a workspace. No neurons were wasted in our
many discussions on many topics.
Next, I would like to thank the many other people in Group 49 with whom I have had the
pleasure of working and interacting.
I would especially like to thank the members of the
Translation Project whose work provided the foundation for this thesis.
I would like to thank my closest friends and colleagues who have consistently provided a
unique and positive college experience. Working, playing, partying and arguing with you have
made my life enjoyable. Without you, my studies at MIT would have been a trivial
accomplishment.
I also thank MIT for maintaining such an intriguing and worthwhile atmosphere for
learning and for growing up. I have accomplished more at MIT than I ever thought was mentally
and physically possible.
I must also thank Beer, as it has kept me sane through my extreme experiences at MIT.
And in thanking Beer, I thank Cornwalls and its kind and generous staff as well, may serve good
beer for the rest of eternity.
I finally thank my parents, as I would be nothing without them. They always guided me
to become whatever I dreamed, and now I am here. I thank you.
3
TABLE OF CONTENTS
1.
2.
.
.
.
.
.
6
1.1 Missing Argument Referent Identification
.
.
.
.
.
7
1.2 The CCLINC System
.
.
.
.
.
8
1.3 Research Goals
.
.
.
.
11
1.4 Thesis Goals
.
.
.
.
11
INTRODUCTION
.
.
.
.
.
.
1.5 Research Domain
.
.
.
.
.
.
.
.
12
1.6 Summary
.
.
.
.
.
.
.
.
12
.
.
.
.
.
.
.
.
13
2.1 Structure of the Missing Argument Module
.
.
.
.
.
14
2.2 Seven Components of the System
.
.
.
.
.
17
2.2.1 Sentence and Paragraph Distance
.
.
.
.
.
18
2.2.2 Coreferent Grammatical Function
.
.
.
.
.
21
2.2.3 Verb Form Correlation
.
.
.
.
.
24
2.2.4 Coreferent Action Capability.
.
.
.
.
.
27
2.2.5 Coreferent Description Capability
.
.
.
.
.
31
2.2.6 Coreferent Noun Category
.
.
.
.
.
.
35
2.2.7 Coreferent Action Recency
.
.
.
.
.
.
36
.
.
38
.
39
.
SYSTEM COMPONENTS
2.3 Chapter Summary
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
3.2 Results from Training the Algorithms
.
.
.
.
.
43
3.3 Training Summary
.
.
.
.
.
49
.
.
.
.
.
49
.
50
3. SYSTEM TRAINING.
3.1 A Method for Combining Weights
.
3.4 Chapter Summary
4.
.
CONCLUSION
.
.
.
.
.
.
.
4.1 Summary of Research Work
.
.
.
.
.
.
.
.
.
.
50
4.2 Future Work
.
.
.
.
.
.
.
.
52
4.3 Thesis Summary
.
.
.
.
.
.
.
.
53
4
LIST OF FIGURES
FIGURE 1-1
CCLINC Translation System Overview
FIGURE 1-2
CCLINC Parse Tree
FIGURE 1-3
CCLINC Semantic Frame
FIGURE 2-1
Semantic Frame with Missing Argument
FIGURE 2-2
Semantic Frame with Empty Placeholder
LIST OF TABLES
TABLE 3-1
Training Results for the Sentence and Paragraph Distance Method
TABLE 3-2
Training Results for the Coreferent Grammatical Function Method
TABLE 3-3
Training Results for the Verb Form Correlation Method
TABLE 3-4
Training Results for the Coreferent Action Capability Method
TABLE 3-5
Training Results for the Coreferent Description Capability Method
TABLE 3-6
Training Results for the Coreferent Noun Category Method
TABLE 3-7
Training Results for the Coreferent Action Recency Method
5
CHAPTER 1
INTRODUCTION
Automated natural language understanding is the basis of present-day, state-of-the-art
machine translation systems.
One approach to machine translation is an interlingua based
approach as presented in An Introduction to Machine Translation, by Hutchins and Somers [4].
In this approach, the input language is first translated into a language neutral meaning
representation, which is then translated into the target language.
The Common Coalition
Language System (CCLINC) at Massachusetts Institute of Technology Lincoln Laboratory is one
such system.
The integration of a missing argument referent identification and replacement
module into such a system has the potential to enhance the system's translation accuracy on
discourses with missing arguments. This potential exists because the missing argument module
can resolve some of the ambiguity inherent to these discourses.
This chapter introduces automated missing argument referent identification and its
orientation to the CCLINC system. It then provides an overview of the CCLINC system. After
the overview of the CCLINC system, the goals of the research described in this thesis are given,
as well as the goals of the thesis itself.
Finally, the domain on which this thesis and the
corresponding research work was implemented is described. In general, this chapter introduces
the concept of missing argument referent identification and explains why it may be an appropriate
extension of the CCLINC translator.
6
1.1 MISSING ARGUMENT REFERENT IDENTIFICATION
In order to best perform natural language translation, it is imperative to understand the
meaning of the input language. In order to accurately understand a sentence within a discourse, it
is often necessary to extract information that is found external to the sentence in question. For
example, if the sentence, Webster received a hit from one, appeared in a discourse, and its
translation was attempted without taking into account its meaning within a discourse, the
translation output could be incorrect.
The meaning of this sentence becomes clear when its
context in the discourse is examined:
Webster sighted three torpedoes. Webster received a hit from one.
The ambiguity in the sentence, Webster received a hitfrom one, arises due to the fact that
it cannot be intuitively determined that Webster is a ship by examining the sentence by itself.
Thus, determining an accurate meaning representation of a given sentence can require identifying
the context in which the sentence occurs.
Missing arguments (missing subjects or objects within a sentence) often provide similar
ambiguities to those seen in the example given above. This similarity can be easily recognized by
examining the effects of removing the argument, Webster, from the initial sentence. In the same
sense that the initial sentence was ambiguous because of the uncertain meaning of the word,
Webster, the new sentence with Webster omitted is ambiguous because it is difficult to discern the
precise meaning of the sentence, when examined by itself.
Determining that a ship is the
appropriate coreferent of this missing argument helps to disambiguate the meaning of the
sentence, as knowing that a ship received a hitfrom one enables the identification of the meaning
of the words, received, hit, and one.
Consequently, identifying the coreferents of missing
arguments within a discourse allows for a more accurate understanding of an input sentence.
One of the applications of automated natural language understanding is machine
translation. It has recently become clear that accurate machine translation of natural language is
7
difficult without adequate natural language understanding. Since much of machine translation
relies on accurately understanding an input discourse, the improved ambiguity resolution that
missing argument referent determination accomplishes may lead to an improvement in the
accuracy of machine translation.
To recapitulate, missing argument referent determination is the process of ascertaining
the appropriate entity that would fit in place of a missing argument in a given discourse in order
to resolve some of the ambiguity inherent to sentences with missing arguments.
Automated
natural language understanding and, more specifically, the CCLINC system's method of
interlingual machine translation has driven the research of missing argument referent
determination described in this thesis.
By incorporating a missing argument referent
determination and replacement module into a machine translation system such as the CCLINC
system, certain ambiguities that arise due to missing arguments can be resolved, potentially
resulting in improved machine translation accuracy.
1.2 THE CCLINC SYSTEM
The Information Systems Technology Group at MIT Lincoln Laboratory has developed
an automated English to Korean and Korean to English translation system, called the Common
Coalition Language System, or the CCLINC system'. A primary goal of the CCLINC system is
to provide enhanced C4I2 for military forces by providing an aid for translation. The interlingual
approach used by the CCLINC system allows for a simple extension of the system in order to
translate other languages. The recent development of the system, however, has been focused on
performing translation between English and Korean.
The CCLINC system utilizes a semantic frame language as an interlingual representation
of natural language. All input texts are first translated into this semantic frame language through
the use of a language understanding module. After the semantic frame representation of the input
sentence is created, the system hands the work off to a generation module, which then produces
'Continued work and improvement upon the CCLINC system is in progress.
8
sentence
I
fullparse
statement
predicate
subject
I
I
verb-phrase
noun-phrase
det
nn head
verb:go
v-dirprep-phrase
verbto
dir object
noun-phrase
the
man
went
to
det
nn head
the
store
I
I
FIGURE 1-2: EXAMPLE CCLINC parse tree output for the sentence, "the man went to the store."
Once the parse tree for a given input sentence is built, the system generates the semantic
frame. The semantic frame for, the man went to the store, is given in figure 1-3. The importance
of the semantic frame is that, if it constructed correctly, it captures the core meaning of the input
sentence. Prior to the incorporation of missing argument referent determination and replacement,
the CCLINC system created the semantic frames in a relatively context free manner with respect
to the discourse.
{c statement
:topic {q person
:name "man"
:pred {p the
:global 1} }
:subject 1
:pred {p go.direction_v
:mode "past"
:pred {p vjto-goal
:topic {q location
:name "store"
:pred {p the
:global 1}I
} }}
FIGURE 1-3: CCLINC semantic frame construction for the sentence, the man went to the store.
10
the output of the system.
Figure 1.1 is a graphical representation of the CCLINC system
structure, as provided in Automated English / Korean Translation for Enhanced Coalition
Communications by Weinstein, Lee, Seneff, Tummala, Carlson, Lynch, Hwang, and Kukolich
[15].
Because of the interlingual approach that the CCLINC system takes, the system can be
extended to any new language by constructing the appropriate lexicon and grammar for a given
language, and by adapting that lexicon and grammar into the language understanding and
generation modules.
When given an input sentence, the understanding module of the CCLINC system first
syntactically labels input sentences.
This syntactic labeling in consists of identifying the
structural representation of individual words and phrases through the creation of a hierarchical
tree called a 'parse tree'. Figure 1-2 is an example of a parse tree that CCLINC would create for
the input sentence, the man went to the store with his wife.
ENGLISH E
TEXT OR
SPEECH
SEMANTIC FRAME
(COMMON
COALITION
LANGUAGE)
GENERATION
FIGURE 1-1: CCLINC translation system overview.
2 Command,
Control, Communications, Computer, and Intelligence.
9
f)
KOREAN
TEXT OR
A SPEECH
1.3 RESEARCH GOALS
One of the domains that the CCLINC system has been designed to translate includes
telegraphic military text with many missing arguments. Other domains that may eventually need
to be translated may also have missing arguments. Translating individual sentences with missing
arguments in a discourse independent manner, even if the missing argument is readily discernible
from the input in the input language, can result in output sentences that are either semantically
incorrect or that have missing arguments whose referents cannot be easily determined in the
output language.
Consequently, the ability to accurately determine and replace missing
arguments during the creation of the intermediary semantic frame representation can potentially
lead to a higher translation accuracy in the CCLINC system.
One goal of the work described in this thesis is the development a module that uses an
intermediary semantic frame representation of an input discourse to automatically and accurately
discern the referents of missing arguments within that discourse. The application of this module
will ideally alleviate some of the ambiguities that are caused by the occurrence of missing
arguments within a discourse.
This application could potentially enable a better-automated
understanding and consequent machine translation of input discourses with missing arguments.
A separate goal of the research work described in this thesis is related to the theoretical
aspects of natural language understanding.
By designing and developing a unique method to
determine the referents of missing arguments in a discourse, the work will hopefully extend the
field of missing argument referent determination in general.
1.4 THESIS GOALS
The goal of this thesis, then, is four fold: to describe methods for automatic missing
argument replacement and referent determination; to evaluate the reasons behind these methods;
to describe the incorporation of these methods into a comprehensible module in the CCLINC
system; and to derive conclusions from the theoretical and implementational work performed.
The components of the missing argument referent determination and replacement module
are discussed in chapter two. The elements of training and the explanation of the development of
the missing argument module are explained in chapter three. Conclusions from the work as well
as a look into the future of automated missing argument replacement and referent determination,
both with relation to the CCLINC system and separate from the CCLINC system, are found in
chapter four.
11
1.5 RESEARCH DOMAIN
The work described in this thesis maintains a basis in the MUC-II military
domain. The MUC-II data is a primary basis of the work described herein for three reasons. First,
its accurate translation was the original interest of the CCLINC system. Second, it contains many
instances of missing arguments. Finally, sentences in the MUC-II domain have been readily
available during the development of the thesis work. The following is a paragraph that provides
an example of the data found in the MUC-II domain:
Friendly CAP aircraft splashed hostile MIG-22 proceeding inbound to
Independence at 95 NM. Last hostile acft in vicinity. Air warning red weapons
tight. Remaining alertfor additionalattacks.
In this example, note that the last three sentences have missing antecedents.
By
determining the exact antecedents for sentences such as these, the machine and human
comprehensibility of the discourse can be improved.
It should be noted that MUC-II data cannot be directly published in this thesis.
Consequently, proper names, times, distances, and locations have been altered in the presentation
of this thesis only: test data and training data used by the system were not altered in this fashion.
1.6 CHAPTER SUMMARY
This thesis presents a new method for determining coreferents of missing arguments in an
input discourse and replacing those missing arguments with their appropriate coreferents. The
thesis also describes how this method is integrated into the CCLINC system at MIT Lincoln
Laboratory. Finally, this thesis explains the results of this implementation, and provides insight
into future work in this area.
12
CHAPTER 2
COMPONENTS OF THE MISSING ARGUMENT MODULE
Automatically identifying the appropriate referents of missing arguments requires a broad
range of linguistic information. Advances in automated natural language understanding have
allowed computers to make use of more and more linguistic information at higher and higher
accuracies. As a consequence, the development of the missing argument referent determination
system as described in this chapter is based on the present state of automated natural language
understanding, and, more specifically, on the CCLINC translation system at MIT Lincoln
Laboratory.
This chapter presents the structure of the missing argument referent determination and
replacement module and explains how this module fits into the CCLINC system. The chapter
then describes the seven components of the missing argument module that are used to identify the
appropriate referents of missing arguments in a given domain.
These components each
incorporate a unique factor that contributes to the final identification of the referent of a missing
argument.
The seven factors that correspond to these seven components are Sentence and
ParagraphDistance, Coreferent Grammatical Function, Verb Form Correlation, Coreferent
Action Capability,Coreferent DescriptionCapability, CoreferentNoun Category, and Coreferent
Action Recency. The methods by which the seven components extract these factors from the
input discourse form the basis of the missing argument referent determination and replacement
module and are explained along with the description of the seven factors in this chapter.
13
2.1 STRUCTURE OF THE MISSING ARGUMENT MODULE
The basic structure of the missing argument referent determination and replacement
module was formulated from the characteristics of natural language and from those of the
CCLINC system itself. The CCLINC has the characteristic that the intermediary stage of the
system outputs a semantic frame representation of input sentences, on a sentence by sentence
basis. One characteristic of natural language is that it provides the means to identify the correct
coreferent of a missing argument by the time the sentence that contains the missing argument is
understood by the listener or reader. Because of these two characteristics, the missing argument
module first takes as input a semantic frame that is outputted by the CCLINC system, while
remembering aspects of previous semantic frames. Then, the missing argument module performs
some computation on the semantic frames up to and including the most recent semantic frame,
attempting to identify the appropriate referent of the missing argument in the present frame, if one
exists. Finally, it produces a semantic frame output with the missing argument replaced by the
entity that the module determined to be its correct referent.
Given that the missing argument module accepts a semantic frame as its input, specifying
the method that the missing argument module uses to determine the appropriate referents of
missing arguments within a discourse is relatively simple. If there is a missing argument within
the present semantic frame, the missing argument module creates an 'empty' placeholder for that
missing argument. The entity that the missing argument module identifies as the correct referent
of the missing argument will eventually be inserted into that empty placeholder.
{c statement
:subject 1
:pred {p engagejtrv
:mode "past"
:topic {q artifact
:name "aircraft"
:pred {p the
:global 1}
:pred {p nn.mod2
:topic "red"
:global 1) } } }
FIGURE 2-1: The semantic frame output of the CCLINC system on the phrase, engaged the red
aircraft,before the insertion of an empty placeholder for the subject.
14
For example, for the phrase 3 engaged the red aircraft,the missing argument module first
accepts the output of the CCLINC system as its input, which, in this case, would resemble the
semantic frame seen in figure 2-1.
After the missing argument module inserts an 'empty'
placeholder in the location of the missing argument, the semantic frame representation of
engaged the red aircraftresembles figure 2-2.
After this task is complete, the missing argument module creates a record for each entity
in the present semantic frame that could possibly be a coreferent of a future missing argument. In
the sentence, engaged the red aircraft,the missing argument, the action represented by the verb
engaged, and the aircraftare all possible coreferents of future missing arguments. By examining
the three sentences found in examples 2-1 through 2-3, the fact that these entities are all possible
coreferents of future missing arguments becomes clear. The missing arguments in each of these
three sentences, were they to follow, engaged the red aircraft, in the input discourse, would
corefer with the missing argument, the 'engaging', and 'the aircraft' respectively4 .
{c statement
:topic {q empty-placeholder}
:subject 1
:pred {p engagetr_v
:mode "past"
:topic {q artifact
:name "aircraft"
:pred {p the
:global 1}
:pred {p nn.mod2
:topic "red"
:global 1) } } }
FIGURE 2-2: The semantic frame output of the CCLINC system on the phrase, engaged the red
aircraft,after the insertion of an empty placeholder for the subject.
sentences given in this thesis are not directly from the MUC-II domain (as introduced in chapter
one) unless specifically noted. The example here, is not is a MUC-I sentence.
4 At first glance the adjective 'red' may seem as if it could possibly be a coreferent of a future
missing argument, this is not the case in the English language. In English, pronouns and missing
arguments cannot refer to adjectives. In general, possible missing arguments (and pronouns as
well) corefer only with nominal entities: noun phrases and nominal forms of verb phrases.
3Example
15
Also engaged the submarine.
EXAMPLE 2-1: Example sentence whose missing argument corresponds with the missing
argument of engaged the red aircraft.
Resulted in a brutal destruction.
EXAMPLE 2-2: Example sentence whose missing argument corresponds with the act of engaging
in the sentence engaged the red aircraft.
Looked as if it were orange from our perspective, however.
EXAMPLE 2-3: Example sentence whose missing argument corresponds with the aircraft in the
sentence engaged the red aircraft.
Once the records are created for each possible coreferent of future missing arguments, the
records are updated with all of the pertinent information about corresponding entity. The exact
information that is inserted into these records depends upon the implementation of the seven
components that comprise the system.
After the records in the present semantic frame are filled in, if a missing argument occurs
within the present semantic frame, the missing argument module attempts to identify the correct
referent of this missing argument.
The missing argument module accomplishes this by
sequentially applying seven methods (described below) to each record that corresponds to a
potential coreferent of the present missing argument. Each of these methods weights each of the
potential coreferents based on the unique features of each potential coreferent and on the
characteristics of the missing argument itself.
Once all entities that are considered possible coreferents of the present missing argument
are weighted by each of the seven methods, the missing argument module combines the output of
the seven methods and assesses a final weight to each possible coreferent.
The module then
guesses that the highest-weighted possible coreferent is an actual coreferent of the missing
argument.
16
Once the missing argument module chooses a final coreferent, the module inserts a copy
of the chosen coreferent into the location of the missing argument in an appropriate lexical form.
The module then outputs the semantic frame with the missing argument replaced by the chosen
entity. This entire task is repeated until the end of the discourse is reached.
2.2 SEVEN COMPONENTS OF THE SYSTEM
Seven primary components of the missing argument module work together toward the
goal of missing argument referent identification. These seven components are based on seven
general linguistic factors that differentiate possible coreferents of missing arguments.
These
seven factors and the implementation of their corresponding methods are described in this
section.
Four of these seven factors require the module to evaluate certain relations on words and
word senses.
For example, determining the Coreferent Action Recency factor (described in
section 2.2.7) for a given possible coreferent requires identifying the synonyms of various verbs.
The capability to produce various relations given a word or word sense is incorporated into the
missing argument module through the use of a lexical and semantic resource called WordNet 5.
For the purposes of the missing argument module, WordNet provides natural language relations
between words and word senses, and is a major contributor to the methods that evaluate the
Coreferent Action Capability, the Coreferent Description Capability, the Coreferent Noun
Category, and the Coreferent Action Recency factors (described in subsections 2.2.4 through
2.2.7.)
The application of the seven individual methods that evaluate the seven linguistic factors
is a simple task for the missing argument module, as each method is applied separately. This,
however, provides the constraint upon the methods that they embody an algorithm that
independently weights every possible coreferent of each missing argument within a discourse.
Once the set of seven weights for each possible coreferent of a given missing argument is
5 George Miller, Principal Investigator.
Princeton New Jersey.
WordNet v1.6. Developed at the Cognitive Science Laboratory,
17
determined by the system, the system optimally combines the weights and outputs a total overall
weight for each possible coreferent.
While the seven methods that incorporate the seven
linguistic factors are described below, the method used for this combination is described in
chapter three.
2.2.1 SENTENCE AND PARAGRAPH DISTANCE
The first factor that can be used to weight possible coreferents of a missing argument is
Sentence and ParagraphDistance. The method that encapsulates this factor is founded on
concepts presented in Description of the PIE System Used for MUC-6, by Dekang Lin [7]. The
basis of this factor is simple: those possible coreferents that are further back in the discourse from
the missing argument are less likely to corefer with the missing argument.
For example, one might expect that an actual coreferent of a missing argument is located
in the previous sentence more often than it is located eight sentences back. In addition, it is
expected that those possible coreferents of a missing argument that occur in a previous paragraph
also have a lower probability to be the actual coreferent of that missing argument than those that
occur in the same paragraph.
In general, possible coreferents of missing arguments that are
different distances from the missing argument should be weighted according to the probability
that the actual coreferent is going to be found at that distance.6 All that is left to determine is the
likelihood that the actual coreferent is going to be found at each possible sentence distance.
The appropriate weight for each sentence distance and paragraph distance can be
determined using a simple statistical calculation. This method requires a training discourse with
the missing arguments of the training discourse marked and their coreferents correctly filled in.
To determine the appropriate weights for each of the possible sentence distances, the algorithm
trains on this data and determines the percentage of actual coreferents found at each possible
sentence distance in the training data. Additionally, during this training, the percentage of actual
The fact that not all coreferents have the possibility of existing a certain number of sentences prior to and
in the same paragraph as the missing argument should also be taken into consideration. In the extreme, for
example, we may determine that whenever a missing argument has the possibility of occurring four
sentences back, that it always does. Consequently, the appropriate weight for a sentence distance of four
would be lower than it should be, if this were not taken into consideration.
6
18
coreferents that occur in the present paragraph versus previous paragraphs can be calculated.
Pseudo code for this training proceeds as follows:
For each missing argument in the training data:
increment counters SN[O] through SN[N] where there are N
sentences
preceding the missing argument and following the previous
paragraph break.
increment counters PN[O] through PN[M] where there are M
paragraphs preceding the missing argument and following the
beginning of the discourse.
If the actual coreferent of the missing argument is in the same
paragraph as the missing argument:
SD <- the number of sentences between the missing argument
and its appropriate coreferent
PC[0] <- PC[0] + 1
SC[SD]
<- SC[SD)
+ 1
Else:
PD
<- the number of paragraphs between the missing argument
and its appropriate coreferent
PC[PD]
<- PC[PD]
+ 1
Next missing argument
Separately Calculate:
SperP <- average number of sentences per paragraph in the training data.
Note that after the training is complete, the following variables are set:
SperP:
Average number of sentences per paragraph in training data
PN[X]:
Number of times a paragraph existed that was X paragraphs
prior to a missing argument.
SN[X] :
Number of times a sentence existed in the training
discourse that was X sentences prior to a missing argument
in the same paragraph as that missing argument.
PC[X]:
Number of times the actual coreferent of a missing argument
was found X paragraphs prior to the missing argument
itself.
SC[X] :
Number of times the actual coreferent of a missing argument
was found X sentences prior to the missing argument itself,
given it was in the same paragraph as that missing
argument.
19
Once the training is complete, the algorithm is ready to run on the actual data. The
multiplication of the paragraph distance weight for the present paragraph by the sentence distance
weight for the present paragraph can be used as the weight for possible coreferents that are
located in the same paragraph as the missing argument at that sentence distance. The weighting
of possible coreferents that are located in previous paragraphs ignores the sentence distance
weights and instead uses the weights that are calculated for the paragraphs instead.
This is
because empirical data suggests that sentence distances in previous paragraphs minimally impact
the probability that an actual coreferent is found in that previous paragraph.
This combination of sentence distance and paragraph distance provides a reasonably
accurate representation of the expected location of the actual coreferent of a missing argument,
assuming the training data is a representative sample of the data that the system will be run on.
When a missing argument is encountered in the actual data, (and the referent is unknown,) the
algorithm weights each previous possible coreferent according to the percentages accumulated
during training. The following pseudo-code provides this function:
For each possible coreferent C of a given missing argument MA:
P <- number of paragraph breaks between C and MA
S <- number of sentences between C and MA (same sentence = 0)
If P = 0 then
SPDWeight <- (PC[0]/PN[0])
* (SC[S]I/SN[S])
Else:
SPDWeight <- (PC[P]/PN[P]) * (1 / SperP)
Next C
In essence, SPDWeight corresponds to the percentage of times in the training data the
correct coreferent was the same distance from the missing argument as the possible coreferent is
in the actual data. This weight only factors in sentence distance if the possible coreferent is in the
same paragraph as the missing coreferent.
20
The necessity for dividing the SPDWeight by the number of times a sentence or
paragraph distance was seen in the training data arises because it is imperative to determine the
percentage of times the potential coreferent is the correct coreferent for a certain sentence
distance given that that sentence distance is available. The reason for dividing the SPDWeight by
the average number of sentences per paragraph is that it is necessary to convert all weights to a
per-sentence basis in order to keep them uniform across all possible coreferents of a missing
argument7.
The Sentence and ParagraphDistance method works best for resolving ambiguities that
are otherwise not resolvable when possible coreferents are different distances from one another.
For example, in the following paragraph from the MUC-Il data, the last sentence, shot down at
1532, is ambiguous as to what previous element in the paragraph was shot down; the A-6 or the
P-3 could be the appropriate referent.
CVW-42 launched a strike againstLand5 AFB. No bogey confrontation. TOT
was for the A-2 1045. Ordinance expended. A hostile MIG-22 was intercepted
byfriendlyfighters. Shot down at 1112.
By applying this method to this example, the system can correctly determine the
appropriate coreferent.
In general, humans assert that the most recent reasonable entity that fits as a coreferent to
a missing argument is the correct coreferent. The Sentence and ParagraphDistance method
subsumes this natural human assertion for the missing argument module.
2.2.2 COREFERENT GRAMMATICAL FUNCTION
The second factor that can be used to weight possible coreferents of a missing argument
is the Coreferent GrammaticalFunction factor. The method that incorporates this factor into the
module is founded on ideas presented in Japanese Discourse and the Process of Centering by
7 It would be ideal to convert all weights to a per-possible-coreferent basis. However, the assumption that
the number of coreferents in a given sentence is independent of the sentence location in a sentence permits
21
Walker, Massayo and Cote [14].
The idea behind this factor is that different coreferents that
correspond to different parts of the grammatical structure of a sentence have different likelihoods
of being referred to by a missing argument.
For example, one might expect that the actual
coreferent of a missing subject is more likely the subject of a previous sentence than it is the
indirect object. In general, possible missing argument coreferents that have different grammatical
functions in previous sentences should be weighted according to the probability that the actual
coreferent is going to have that grammatical function within a sentence. 8
The appropriate weight assigned by the missing argument module for this factor for a
given possible coreferent of a missing argument can be determined by applying a method that is
similar to that which calculates the sentence distance heuristic. Like the Sentence and Paragraph
Distance factor, the Coreferent GrammaticalFunction factor requires a training discourse with
the missing arguments of the training discourse marked and their coreferents correctly filled in.
To determine the appropriate weights for each possible grammatical function, the algorithm trains
on data by determining the probability that an actual coreferent of a missing argument performs
each possible grammatical function. The training proceeds according to the following pseudocode:
Let <GRAM FUNC> E { Subject,
Other }
Main Verb,
Object,
Other Noun,
For each missing argument:
If the actual coreferent acts as a <GRAM FUNC>:
<GRAM FUNC>ActualCount < <GRAM FUNC>ActualCount
Other Verb,
+ 1
For each possible coreferent of that missing argument9:
If the possible coreferent acts as a <GRAM FUNC>:
<GRAM FUNC>PossibleCnt (- <GRAM FUNC>PossibleCnt + 1
Next possible coreferent
Next missing argument
simplification of the algorithm to a per-sentence basis. This is done solely for the purposes of simplifying
the analysis.
8 Similar to the Sentence and ParagraphDistance heuristic, it is necessary to take
into consideration the
fact that a certain structural parts of a sentence, such as the indirect object, do not occur as often as other
structural parts, such as the subject of a sentence. Consequently, the appropriate weight for such structural
parts must be adjusted by the number of times the are seen relative to other structural parts of a sentence.
9 The possible coreferent of the missing argument is what the system would consider to be a possible
coreferent if it did not know of the actual coreferent, even though the actual coreferent is tagged in the
training data.
22
Note that after the training is complete, each 'number' variable represents the number of
times its corresponding structural type was encountered as a possible coreferent of a missing
argument, and each 'count' variable represents the number of times its corresponding structural
type was encountered as an actual coreferent of a missing argument.
To run the algorithm on
actual data, the following actions are performed:
Let <GRAM FUNC> E { Subject, Main Verb, Object, Other Noun, Other Verb,
Other }
For each possible coreferent C of a given missing argument MA:
If C acts as a <GRAM FUNC>:
CSTWeight (- ActualCnt<GRAM FUNC> / PossibleCnt<GRAM FUNC>
Next possible coreferent
In general, this method disambiguates by weighting a possible coreferent to a missing
argument based on the percentage of possible coreferents of missing arguments with a specific
grammatical function that are actual coreferents.
The method is best suited for resolving
ambiguities that are otherwise not resolvable when possible coreferents perform different
grammatical functions within a discourse.
For example, in the following short discourse, is
considered a beautiful girl, is ambiguous as to whether it means that Emily is considered a
beautiful girl or that Jill is considered a beautiful girl:
Emily went to the store with Jill. Is considereda beautiful girl.
In most normal discourses, the ambiguity is resolved because the subject of the previous sentence
is more likely to be referred to in future sentences than another noun in that sentence. For these
discourses, the coreferent grammatical function method would behave as expected and would
choose 'Emily' as the appropriate antecedent to the second sentence. This method also takes into
consideration discourses with abnormal sentence structure such as seen in the MUC-II corpus.
For example, if the portion of the discourse previous to the example discourse resembled the
23
following example, then it would not be unreasonable to assume that 'Jill' is the appropriate
antecedent to 'is considered a beautiful girl,' as this would continue the trend in the discourse,
however awkward it may be.
John went to the store with Jane. Is considered a tall woman.
Jenny went to the store with George. Is considereda big boy.
Billy went to the store with Sarah. Is considered a loud girl.
The structural type algorithm would recognize these types of trends because it would
train on similar sentences.
2.2.3 VERB FORM CORRELATION
One problem encountered when handling missing arguments is that commands in a
discourse cannot easily be distinguished from sentences with missing arguments. For example,
humans automatically recognize the subject of the sentence when presented with the well-formed
sentence, "go to the store." Speakers of English immediately recognize this as a command, and,
being a well-formed sentence, know that the implied subject of the sentence is "you" or the
listener of the sentence. When dealing with grammar that includes missing arguments, this is not
necessarily the case, as seen in the following paragraph from the MUC-II data:
USS Intrepid holding active subsurface contact fm unit. Contact tracking.
Considercontact to have executed counter attack. Contact is threat.
In this discourse, the sentence, consider contact to have executed counter attack, is not a
command, even though it is in the root form. Consequently, whenever the automated missing
argument encounters a sentence in the present tense with a missing subject, it cannot assume that
it is a command. It must treat the sentence as if it had a missing argument until it determines
otherwise. This brings about the purpose of the third factor that we can use to weight possible
coreferents: verb form correlation.
24
Because human intuition assumes certain aspects about a missing argument based on the
verb form following a missing argument, it is possible to develop a method that weights possible
coreferents based on variations in these assumptions between different domains.
In many
domains, sentences that begin with a verb in the present tense can be assumed to corefer with the
listener of that sentence. In other domains, such as in MUC-II, it is necessary to come up with a
different system of weights that correspond to the different types of assumptions that readers
make based on verb form.
Thus, the third method for weighting possible coreferents measures the assumptions that
humans make on the verb form of a verb that follows a missing argument. As with the first two
methods, this requires training on data that has missing arguments marked with their appropriate
coreferents. Once this is available, the system trains on this data and determines the percentage
of missing arguments whose actual coreferents fall into certain categories given the form of the
verb that follows the missing arguments. Then when the system runs on the real data, a possible
coreferent that falls into one of these categories will be determined based on the verb form that
follows the corresponding missing argument, and the percentage of times the training resulted in
that category given that verb form.
The appropriate categories of possible coreferents and different groupings of verb
formations are dependent upon the size of the training data as well as the domain itself. The
dependence upon the size of the training data arises because, if the proportion of the number of
categories or the number of groups of verb formations to the size of the training data is too large,
then there will not be enough trials from the training data to sufficiently populate the categories
and groups. If there is not enough data for each of the categories and each of the verb formation
groups, the statistical correspondence between the categories and the correct coreferent will be
weakened.
The reason why the exact categories of possible coreferents and different groupings of
verb formations are dependent upon the domain on which the system will run can also be easily
examined.
For example, MUC-II domain is generally narrative.
Because no referents to a
listener are made no 'listener' category is necessary. On the other hand, whenever commands are
25
given in a non-narrative discourse, the missing argument 0 always refers to the listener, and the
commands are generally in the present tense. For domains with such sentences, it would likely be
useful to have 'listener' as a category of possible coreferents. Because the distribution of verb
forms within a given discourse is domain dependent, the algorithm must account for different
verb formation groupings depending upon the domain.
The categories of coreferents and groupings of verb formations determined by the
missing argument module were chosen by manually examining the data on which the system
trained".
The three resultant categories of possible coreferents used by the system are the
speaker, people and items capable of action, and other possible coreferents. The four groups of
verb formations used by the system are the non-passive present tense formations, passive
formations, verb formations in the 'ing' form, and all others.
Like the Sentence and ParagraphDistance algorithm and the Coreferent Grammatical
Function algorithm, this algorithm requires a set of training data with the appropriate coreferents
to the missing arguments marked. The algorithm then performs the following steps to train on the
training data:
Domain Specific:
Let <FORM>
e (Present, Passive, Pluperfect, Other)
Let <CATEGORY> e (Speaker, People and Items Capable of Action, Other)
Domain Independent:
For each missing argument MA:
MAVerb <- the verb that follows MA (that MA performs)
For each possible coreferent C of MA:
If C e <CATEGORY> And MAVerb E <FORM>:
Number<CATEGORY><FORM> <- Number<CATEGORY><FORM> + 1
Next possible coreferent
A <- actual coreferent of MA
If A e <CATEGORY> and MAVerb E <FORM>:
Count<CATEGORY><FORM> <- Count<CATEGORY><FORM> + 1
Next missing argument
10 As previously noted, even when the subject, 'you', is absent in an English command, the sentence
remains grammatically well-formed, and consequently, the lack of a subject is not technically a missing
argument.
However it must be treated as such when dealing with a discourse with actual missing
arguments.
" Although it may be possible to automate the formation of coreferent categories and verb tense groupings,
it is beyond the scope of this work.
26
The 'number' variables represent the number of times a specific category and a specific
verb formation grouping occurs for each possible coreferent of each missing argument.
The
'count' variables represent the number of times a specific category and a specific verb formation
grouping corresponds to the actual coreferent of each missing argument.
Once the training is complete, the system uses the information accumulated to run on the
actual data. In general, for a given possible coreferent category and missing argument verb form,
the system outputs the percentage of times the category and form occurred together when the
possible coreferent was the actual coreferent.
This can be calculated by the following pseudo-
code:
For each possible coreferent C given a missing argument MA:
<CATEGORY> <- the category into which C falls
<FORM> (- the verb formation grouping of the verb that
corresponds to MA
VTCWeight (- Count<CATEGORY><FORM> / Number<CATEGORY><FORM>
Next possible coreferent
The verb formation correlation algorithm is similar to the previous two algorithms in that
it determines characteristics of the training data with respect to each possible coreferent of a
missing argument, and then weights the possible coreferent according to the percentage of times
those characteristics correspond with the characteristics of the actual coreferents in the training
data.
The characteristics that this algorithm uses, however, are the verb formation that
corresponds to the missing argument and the category of the possible coreferent of that missing
argument.
2.2.4 COREFERENT ACTION CAPABILITY
The fourth factor that affects the output of the missing argument module is based on the
concept that any given noun is only capable of performing a subset of all possible actions. The
27
following example demonstrates a sentence where the system could use this method to determine
the referent of a missing argument:
1945 USS Intrepid sighted 2 torpedoes 3000 yds off stbd bow. Launched urgent
asroc attack. Sighted third torpedo and launched third asroc attack based on
wake locations. Washington in company with Intrepidnot believed attacked.
Here, the second and third sentences have missing subjects. However, by noticing that
the only entity mentioned prior to the third sentence that could possibly 'launch an attack' or
'sight a torpedo' is the ship, the USS Intrepid. If the module could recognize that ships can
perform the actions launching something and sighting something, it would determine that Intrepid
is a possible coreferent of the missing arguments in these sentences. Additionally, if the module
could recognize that neither torpedoes nor yds nor bow nor asroc attack could perform these
actions, then it could correctly identify the missing arguments above with this knowledge alone.
The method that we can use to evaluate the capability of a possible coreferent to perform
a given action is relatively straightforward. The method determines that a given noun is capable
of performing a given action by noticing that that noun performed that action in the training
discourse. As a consequence, the method first requires training on a set of input data, recording
noun-verb pairs when any noun performs any action in that training data. Then, as long as the
training data is similar to the actual data, the system can correctly determine some of the
capabilities of nouns that it has previously encountered. When the system is trained on similar
data as the test data, the fact that the module has encountered 'a ship' that 'sights' in the training
data, but has not encountered 'a torpedo' or 'an asroc attack' that 'sights', will cause the system
to correctly identify the missing argument coreferent in the case above.
One problem with this method is that for very diverse domains, the method may require a
large amount of training data. This problem arises because the method requires that the system
encounter any given noun-verb pair in the test data with reasonably high probability. This may
require massive amounts of test data. Because the MUC-II corpus is so small, an enhancement to
this method might be useful.
28
To increase the coverage of this method without increasing the amount of training data
required, it is useful to note that if something can fly, then it can also move. Additionally, if 'a
plane' can perform some action, then knowing that 'an aircraft' is synonymous with 'a plane'
implies that 'an aircraft' can perform that action as well. To generalize this concept, if something
has the ability to perform a specific action, then it also has the ability to perform other more
general actions that subsume this action. Additionally, if some entity is capable of performing a
specific action, then other entities of which the original entity is a type can also perform that
specific action. When one entity is a type of a second entity, the first entity is considered a
hypernym of the second, more specific entity (and the specific entity is the hyponym of the more
general entity.)
To accurately determine the pairs of words that should be included in the list, the missing
argument module can utilize hypernym and synonym relations provided for by WordNet. The use
of the WordNet database allows the missing argument module to generate a list of hypernyms or
synonyms given a precise meaning of a word. The following is pseudo-code for the appropriate
implementation of the training:
For each sentence in the training data:
For each noun N in the sentence that performs some action V:
If N is a proper noun:
N <- improper equivalent of N
of nouns WordNet considers to by synonyms
NList <- the list
of the appropriate sense of the noun N.
of verbs WordNet considers to be hypernyms
VList <- the list
of the appropriate sense of the verb V.
For each verb VL in VList:
For each noun NL in Nlist:
increment the count associated with the NL-VL
pair in the noun-action-capability database
Next noun
Next verb
Next noun
Next sentence
Unfortunately, appropriately determining the sense of the words in the output semantic
frame is necessary for the appropriate use of the hypernym and synonym relations. Additionally,
such sense disambiguation is beyond the scope of the work described in this thesis. This task
29
may be difficult in general due to the ambiguous nature of sentences with missing arguments.
Consequently, the implemented system disambiguates the word sense by defaulting to the most
common sense of the word according to WordNet. This will inevitably cause problems, as the
most common sense of the word according to WordNet will often be the incorrect sense of the
word, on account that WordNet was trained on a domain that is very different than the MUC-II
domain. Future improvements upon this system may wish to focus in on this issue. However,
even though the implemented missing argument module has this fault, this method can still be
applicable as some correlation will exist between the output of this method and the likelihood that
a given potential coreferent of a missing argument is a correct coreferent.
Once the training is complete, when a missing argument performs an action in the test
data, this method is capable of outputting a weight for each possible coreferent that is based on
that possible coreferent's capability to perform the action that the missing argument performs.
This capability is determined in the following manner from the list created by training:
For each possible coreferent C of a given missing argument, MA:
V <- the base verb corresponding to the action that ma performs
If C is a noun
N (- an improper noun corresponding to the possible
coreferent
If C corresponds to a verb phrase
N <- the noun form of the main verb corresponding to C
NVCount <- DatabaseCount(N,V)
If NVCount > Threshold:
CACWeight <- 1. 0
Else:
CACWeight <- 0.0
Next possible coreferent
The Coreferent Action Capability method is different than the previous methods in that
the data it trains on does not require the appropriate missing argument coreferent to be marked. It
is also unique in that training on any sort of data will likely be helpful for resolving coreferent
action capabilities in any domain, due to domain independent properties of nouns and verbs.
30
2.2.5 COREFERENT DESCRIPTION CAPABILITY
The fifth factor that can be used to weight a possible coreferent of a missing argument is
the possible coreferent's description capability.
The method for applying this factor is very
similar to that for the Coreferent Action Capability factor, as it shares the concept that certain
characteristics of the missing argument are likely to match the characteristics of its appropriate
coreferent.
In the description capability method, the algorithm first determines if the missing
argument is described by an adjective. If it is, the weight that the method outputs is based on the
likelihood that the potential coreferent would be described by the adjective that describes that
missing argument in the discourse.
To apply this method, as with the application of the missing argument Coreferent Action
Capability method, it is necessary to train the system on a set of data. In this method, however,
the training consists of marking down noun-adjective pairs whenever an adjective describes a
noun in the training data (instead of noun-verb pairs, as with Coreferent Action Capability.)
Once the training is complete, the system has a list of nouns-adjective pairs such that all of the
adjectives in the list described their corresponding nouns somewhere in the training data. This
list can then be examined when the system comes to a missing argument that is described by
some adjective; if a potential coreferent of the missing argument appears on the list with that
same adjective, then that potential coreferent is a reasonable candidate as the actual coreferent,
because it has the capability of being described by the adjective. If the potential referent does not
appear on this list with the adjective that describes the missing argument, then, assuming the
training data was large enough, that potential coreferent is not likely capable of performing the
action that the missing argument performs. Consequently, the potential coreferent in question is
not likely the actual coreferent of the missing argument.
The hypernym and synonym relations can be applied to this method in a similar manner
as to the Coreferent Action Capability method. This is done by not only testing whether or not
the adjective that describes the missing argument appears on the noun-adjective list with the
possible coreferent, but by also testing whether or not any of the synonyms of that adjective
appear on the noun-adjective list with the possible coreferent.
31
For example, if the word
'building' were described as being 'gargantuan' in the training data and a missing argument in the
input data was described as being 'enormous', then, since 'gargantuan' can describe 'building'
and 'gargantuan' is a synonym of 'enormous', 'enormous' can describe 'building' as well.
Consequently, whenever building-gargantuanis inserted into the noun-adjective list, buildingenormous can be inserted as well.
A parallel concept is applicable to the nouns that are being described. By including in the
noun adjective list all of the hypernyms of the noun that is described by some adjective during
training, additional coverage of the algorithm can be achieved.
For example, if the word
'motorcycle' were described as being 'fast' in the training data, then, since 'vehicle' is a
hypernym of 'motorcycle', the system can infer that 'vehicles' can be described as being 'fast' as
well. Consequently, when the pair, motorcycle-fast is inserted into the noun-adjective list, the
entry, vehicle-fast, is inserted as well.
In general, whenever a noun is described by an adjective in the training data, a list of
hypernyms of that noun and a list of synonyms of that adjective are created. Each possible nounadjective pair in the product of these two lists are then inserted into the coreferent description
capability pair list.
The training for this method is very similar to that of the Coreferent Action Capability
method, as seen in the following pseudo-code:
For each noun N in the training data described by some adjective A:
If N is a proper noun:
N <- improper equivalent of N
NList 4& the list of nouns WordNet considers to by hypernyms
of the most appropriate of the noun N.
AList - the list of adjectives WordNet considers to be
synonyms of the appropriate sense of the adjective A.
For each adjective AL in AList:
For each noun NL in Nlist:
increment the count associated with the NL-AL
pair in the noun-adjective database
Next noun in Nlist
Next adjective in Alist
Next noun
32
As explained previously in the Coreferent Action Capability section, appropriately
determining the sense of the words in the output semantic frame is necessary for the appropriate
use of the hypernym and synonym relations. Such sense disambiguation is beyond the scope of
the work described in this thesis. Consequently, the implemented system disambiguates the word
sense in this method by defaulting to the most common sense of the word in WordNet. This will
inevitably cause problems, as the most common sense of a word according to WordNet will often
be the incorrect sense of the word, on account that WordNet was trained on a domain that is very
different than the MUC-II domain.
Once the noun-adjective-capability database is built, if enough training data is supplied,
the system should be able to determine if it is reasonable to describe a given noun with a given
adjective, by looking up the pair in the database.
After the database is built, this method weights each possible coreferent of each missing
argument that is described by some adjective. The system uses a rule-based method to determine
if a missing argument is described by an adjective. Such a rule might be:
X is considered Y and Y is an adjective 4 Y describes X
The automation of the creation of such rules is beyond the scope of the work described in
this thesis, as experimental results have determined that a few simple rules in the case of the
MUC-II domain cover many of the cases where a missing argument is described by some
adjective. Thus, the rules used in the system could be developed by manually examining the
training data. Once the rules determine that an adjective describes a missing argument, each
noun-adjective pair (where the noun corresponds to each possible coreferent of that missing
argument) is looked up in the noun-adjective-capability database. If the count for that nounadjective pair is above a certain threshold, then that possible coreferent is given a weight of one;
if the count is not above that threshold, then the possible coreferent is given a weight of zero.
33
12
The exact location of the threshold will depend upon the size and variance of the training data.
The pseudo-code that the system follows for determining this weight for a possible coreferent of a
missing argument is as follows:
For each possible coreferent C of a missing argument MA:
S <- the sentence in which MA is located
run through the set of rules on S
If no rule matches S:
CDCWeight <- 0.0
Else:
A <- the adjective that the rule returns that describes MA
NACount <- DatabaseCount (N, A)
If NACount > Threshold:
CDCWeight <- 1.0
Else:
CDCWeight <- 0.0
End If-Else
Next possible coreferent
This algorithm is most useful when ambiguities arise that can be resolved by determining
that an adjective that describes the missing argument can only describe one of its possible
coreferents. This can be examined in the following short discourse from the MUC-II corpus:
Two unidentified air contacts were fired upon by USS Georgetown with birds at
2218. Both contacts not squawking ID modes. No Esm. Confirmed not friendly.
Picked up at brg 87fin Louisiana. Considered popup. Held by Georgetown and
A-7. Last report.
Here, the missing argument in the sentence, considered popup, is described as being
popup. Since none of the previous words other than contacts can be described by the adjective,
popup, contacts must be the appropriate coreferent to the missing argument in the sentence. The
missing argument module can recognize this if it discovered that of the possible coreferents in
this sentence, only contacts was described in the training data as being popup.
For small amounts of training data, the appropriate threshold is one. Automating the determination of
this threshold for large quantities of training data is beyond the scope of the work presented in this paper.
12
34
2.2.6 COREFERENT NOUN CATEGORY
The sixth factor that contributes to the identification of the actual coreferents of a missing
argument is applicable when the missing argument is compared with or equated to another noun.
In order to appropriately apply this factor to a discourse, the missing argument module must
recognize when two elements are compared with or equated to one another. The method used to
evaluate this factor applies the fact that, when a missing argument is compared with or equated to
a noun, the actual coreferent of the missing argument must also fit in the general category of that
noun. This can be seen in the following example MUC-II sentences:
Have positive confirmation that battleforce is targeted. Consideredhostile act.
Here, the only possible coreferent of the missing argument in the second sentence is the
'targeting' that occurred in the first sentence, as the only entity that falls under the category of an
act is the 'targeting'.
Once again, the hypernym relation can be used to expand the missing argument module's
coverage of this factor. In this case the hypernym relation is used to determine whether one noun
falls into the general category of another.
Thus, to determine the appropriate weight for a
possible coreferent of a missing argument that is compared with a noun, the system lists the
hypernyms of all previous possible entities and determines if the noun to which the missing
argument is compared is in each of the lists. If that noun is in one of these lists, then the entity
that corresponds to that list has a reasonable chance of being the actual coreferent of the missing
argument.
Once again it must be noted that determining the sense of the words in the output
semantic frame is necessary for the appropriate use of the hypernym relation. Because such sense
disambiguation is beyond the scope of the work described in this thesis, the implementation of
35
this method defaults to the most common sense of the word in WordNet for determining the list of
hypernyms of the noun.
The pseudo-code that implements this algorithm is as follows:
For each possible coreferent C for a given missing argument MA:
S <- the sentence in which MA is located
run through the set of rules on S
If no rule matches S:
CNCWeight (- 0.0
Else:
E (- the noun that the rule returns that is equated to MA
of hypernyms returned by WordNet on N
NList <- the list
If E is found in NList:
CDCWeight (- 1.0
Else:
CDCWeight <- 0. 0
End If-Else
End If-Else
Next possible coreferent
This algorithm provides the most ambiguity resolution when actual coreferents can be
determined solely by the fact that the missing argument is compared to or equated with another
noun, and the correct coreferent falls in the category of this noun.
2.2.7 COREFERENT ACTION RECENCY
The final method described for weighting the possible coreferents of a missing argument
is applicable when the missing argument takes some action and that same or similar action was
recently taken by another entity. The factor that this method measures is the Coreferent Action
Recency factor. One of the properties that occasionally makes the referent of a missing argument
intuitively determinable is that the referent of the missing argument performs some action that
some other entity recently performed. Because of this, when an entity has recently performed the
same action as the missing argument, that entity is likely a coreferent of the missing argument.
36
To apply this method, only a few steps need to be taken. First, the 'recency' parameter
needs to be set. This parameter determines how close, in terms of sentence distance, the missing
argument has to be to a possible coreferent that performs the same or similar action as the missing
argument in order for the method to weight the possible coreferent as a likely candidate for being
the actual coreferent. This value can be set manually by estimating, or it can be determined by
performing a statistical evaluation of the training data and determining the optimal measure. The
missing argument module described in this thesis used the former, manual method.
After the recency parameter is set, when the system is running on input data, every time a
missing argument is performing some action, the system creates a list of synonymous actions.
This list is created through the use of WordNet. The system then runs through the 'recent'
possible coreferents to the missing argument, and determines if any of them are performing an
action that is in the list of synonymous actions.
If one or more of the possible coreferents is
performing an action on this list, then those coreferents should be given a weight that is higher
than if they were not performing an action on this list. The pseudo-code for this algorithm is as
follows:
For each possible coreferent C for a given missing argument MA:
<- 0.0
If C is a proper noun:
CV <- the verb that corresponds with the action C performs
MAV <- the verb that corresponds with the action MA performs
MAVList <- synonyms of MAV according to the WordNet database
based on the appropriate sense of MAV
If CV is found in MAVList:
CARWeight
CARWeight
<- 1.0
Next possible coreferent
Because of the use of the synonym relation, word sense ambiguity becomes a potential
problem. Because such sense disambiguation is beyond the scope of the work described in this
thesis, the implementation of this method defaults to the most common sense of the word in
WordNet for determining the list of synonyms of the verb.
37
This algorithm provides the most use when ambiguities in the input can be resolved by
the fact that the missing argument performs an action that is similar to the action performed by a
previous proper entity. This can be seen in the following example:
Two F-14 conducted a WAS against Kara. Conducted SSC to locate hostile ship
and attacked with missiles.
In this example, the correct coreferent of the missing argument in the second sentence is
F-14. This is intuitive because in the previous sentence, the F-14 performs the action conducted
and the missing argument performs the same action.
This algorithm will therefore correctly
disambiguate between F-14 and Kara as to which is the appropriate coreferent to the missing
argument.
2.3 CHAPTER SUMMARY
The structure of the missing argument module is oriented to accept as input the semantic
frame output of the CCLINC system. By identifying possible coreferents, and by keeping records
of previous possible coreferents, the system is capable of applying distinct methods that
individually weight each possible coreferent of each missing argument.
These distinct methods are categorized into seven components of the missing argument
referent determination and replacement module.
Each of these seven methods is capable of
independently supplying a weight to each possible coreferent of a missing argument base on a
unique factor that contributes to the accurate identification of the referents of missing arguments.
A combining algorithm merges the weights outputted by these methods into a single resultant
weight. The optimal combination of these weights will ideally enable the module to identify an
appropriate coreferent of a missing argument. The entity with the highest final weight is guessed
as being an actual coreferent of the missing argument. Once the coreferent of a missing argument
is guessed, the module inserts the appropriate form of that coreferent into the location of the
missing argument, and proceeds to the next semantic frame.
38
CHAPTER 3
SYSTEM TRAINING
For the missing argument referent determination and replacement system to best
recognize coreferents of missing arguments, many of the individual methods described in chapter
two require training.
Once the individual algorithms are trained, system training must be
performed in order to determine the appropriate weight of the various outputs of the individual
algorithms. By training the system in this manner, an estimation can be made of the likelihood
that a given possible coreferent of a missing argument is an actual coreferent.
The purpose of this chapter is to describe the method used for combining the outputs of
the individual algorithms. The reason why this method is appropriate for estimating the likelihood
that a possible coreferent of a missing argument is an actual coreferent of that missing argument
is also explained. The results from applying the individual algorithms to the training data are
presented in this chapter as well.
3.1 A METHOD FOR COMBINING WEIGHTS
In order to determine a method for combining the weights assigned to a possible
coreferent by the individual methods described in chapter two, it is first necessary to examine the
task that the system should be performing. Ideally, the system should determine the exact
probability that a given possible coreferent is an actual coreferent. With a perfect system, a
probability of one would be assigned to the correct referent, and a probability of zero would be
assigned to all of the other entities considered. The actual system, however, will only be able to
39
estimate the probabilities according to the outputs of the individual algorithms. Consequently, it
is desirable for the system to assign a weight to each possible coreferent that is proportional to the
probability that that possible coreferent is a correct coreferent, based on the knowledge available
to the system.
In order to analyze the appropriate method for combining the weights assigned by the
various methods, it will be useful to assume independence among the weights assigned by each of
the seven methods. This assumption provides for a simpler method for determining the optimal
combination of the weights generated for the various factors as described in chapter two. This
combination becomes simpler under the assumption of independence because adjustments of the
overall weight assigned by the system for a given possible coreferent can be made on the basis of
each method separately13 . This simplification and consequent mathematical separation is shown
through the use of equations 3-1 through 3-5 below.
To determine the weight of each individual method, first notice that equation 3-1 holds
true for a single algorithm, A, because of Bayes Law, where Pos is a possible coreferent, Act is
the actual coreferent, Weight(A) is the weight assigned to Pos by the given algorithm A, and K is
some constant:
EQUATION 3-1:
Pr[ Pos = Act IWeight(A)= K]
= Pr[ Pos = Act n Weight(A) = K] / Pr[ Weight(A) = K]
Equation 3-2 extends the rule to all seven algorithms, A, through A7 and corresponding
constants K, through K7 .
EQUATION 3-2:
Pr[ Pos = Act IWeight(Ai) = Ki V i]
Pr[ Pos = Act n Weight(A 1) = K, n Weight(A 2)= K2 r ... n Weight(Ai) = Ki]
Pr[ Weight(Ai) = Ki V i ]
A more accurate system would take into consideration that the weights outputted by each individual
algorithm might not be independent. Such an analysis and the appropriate construction of an according
weight-combination algorithm are beyond the scope of the material presented here.
13
40
Because of the assumed independence, the following equations 3-3 and 3-4 hold true:
EQUATION 3-3:
Pr[ Pos=Act r Weight(A) = K1 n Weight(A 2 ) = K 2 r ... n Weight(A 7) = K7 ]
Pr[ Pos=Act r Weight(A1)=K1 ] * Pr[ Pos=Act n Weight(A 2)=K2
Pr[ Weight(Ai) = Ki V i]
*
Pr[ Pos=Act r Weight(A 7)=K 7]
* ... *
(Pr[ Pos=Act ]
)6
EQUATION 3-4:
=
Pr[ Weight(Ai) = Ki V i]
H Pr[Weight(Ai)
= Ki)
Substituting equation 3-3 and 3-4 into equation 3-2, results in the equation 3-5:
EQUATION 3-5:
Pr[ Pos = Act I Weight(Ai) = Ki V i]
Pr[ Pos=Act r Weight(A1 ) = K,]
*
Pr [Pos=Act n Weight(A 2) = K2 ] *
(Pr[ Pos=Act ] )6
*
1
*
Pr[ Pos=Act r Weight(A 7)=K 7]
Pr[ Weight( Ai )= Ki]
Through the use of equation 3-5, the system can estimate the probability that a given
coreferent is an actual coreferent if it can estimate each of the factors in the second half of
the equation. Fortunately, each of these factors can be estimated if the system trains on a set
of data that is similar to the actual data and records appropriate values according to the
outputs of the individual algorithms for each coreferent of each missing argument.
More
specifically, Pr[ Weight( A; ) = K;] can be estimated by dividing the number of times algorithm
Ai outputs Ki by the number of coreferents seen for all values of i. The value, Pr[ Pos = Act ], is
constant across all possible coreferents of a given missing argument and can consequently be
disregarded as long as it is disregarded for all possible coreferents of a given missing argument.
Pr[ Pos = Act n Weight( A;) = Ki ] can be estimated by dividing the number of times algorithm
41
Ai rates the actual coreferents with the value Ki by the total number of possible coreferents. The
division by the total number of possible coreferents can be disregarded as long as it is disregarded
for all possible coreferents, as the system only needs to maintain proportionality in equation 3-5,
rather than equality. This relaxation is acceptable because only the relative distance among the
weights assigned matter.
Once the above equations are estimated, the result is proportional (in estimation) to the
likelihood that a given possible coreferent of a missing argument is an actual coreferent of that
missing argument. The remaining training work requires running the system on the training data,
to evaluate Pr[ Pos = Act n Weight(Ai) = Ki ] and Pr[ Weight(Ai) = Ki ] for all possible Ai and
Ki. This can be done one algorithm at a time.
To determine the amount of training data that should be used, it is necessary to examine
the change in the values for each algorithm as more training data is accumulated by the system.
As the system trains on more and more data, the values assigned by the above equations for each
individual method should level off. In other words, the system should be trained on enough data
so that the addition of more training data changes the constants set by training only minimally.
3.2 RESULTS FROM TRAINING THE ALGORITHMS
For the training performed by the system, the total number of sentences used for the
MUC-II domain was 647. These sentences comprised 105 paragraphs, and included 41 instances
of missing arguments.
The analysis of training that appears in this chapter is based on the implementation of the
missing argument module. The present missing argument module is built into an older version of
the CCLINC system than that described in chapters one and two.
It is also important to note that the original form of the MUC-II sentences and paragraphs
was different than the form used by the system. The data that the missing argument module uses
went through two steps of modification. In the first step, the sentences were manually extracted
from telegraphic messages and set into a paragraph form with one paragraph per message. From
42
this state, the sentences were altered so that their meaning could be understood by the older
version of the CCLINC system. This transformation included removing the headers from the
paragraphs as well as altering the sentences so that they could be parsed by the system. In
general, the alterations were kept to a minimum. When they were necessary, they were made by
removing the cause of the parse failure while minimally changing the general meaning of the
paragraph. In some cases, the consequence of this was that large sentences were broken up into
two smaller sentences to allow parsing. In other instances of parse failure, sentence fragments
that did not contribute greatly to the meaning of a paragraph and did not parse correctly were
removed completely. If it was necessary to excessively alter a paragraph in order for the system
to parse the sentences, the entire paragraph was removed from the data.
The Sentence and ParagraphDistance algorithm requires training on a set of data with
the appropriate missing arguments marked.
Table 3-1 shows the results from training the
sentence and paragraph system on the MUC-II training data. The last two fields are the values
that the system will place in equation 3-5 when the system runs on the actual data in order to
estimate the probability of a given possible coreferent of a specific missing argument.
The Distance field in table 3-1 corresponds to the distances encountered during training
of the actual coreferents of the missing arguments relative to the missing arguments themselves.
For example, in the following discourse, the coreferent of the first missing argument in the
following discourse would fall in the category One Sentence, as the missing argument refers to
the action of shooting down the aircraft. The coreferent of the second missing argument in the
paragraph would fall in the category Two Sentences in table 3-1 as it refers to the aircraft,which
is found two sentences prior to the missing argument.
The enemy ship shot down the aircraft.Considered hostile act. Was on a topsecret mission.
43
Training Results for the Sentence and Paragraph Distance Method on MUC-II Data
NUMBER OF
*POSSIBLE
Pr[Poss=Act fl
DISTANCE______OCCURRENCES OCCURRENCES
Weight =K
Pr[Weight=K
One Sentence
22
255
.025
.297
Two Sentences
6
136
.007
.158
Three Sentences
2
66
.002
.077
Four or More Sentences
3
57
.003
.067
More Than a Paragraph
5
277
.006
.323
Other**
14
56
.016
.068
*
**
This number is based on the estimated number of possible occurrences per sentence.
Other includes entities such as the listener or the speaker of the discourse.
TABLE 3-1: Training results for the Sentence and ParagraphDistance method on MUC-II data.
The Number of Occurrences field in table 3-1 is calculated by counting the number of
actual coreferents that were Distancefrom their corresponding missing argument in the training
data. The Possible Occurrences field is calculated by multiplying the estimated number of
possible coreferents per missing argument by the total number of missing arguments.
This
estimation was calculated by randomly sampling a portion 4 of the training data and
determining the number of possible coreferents of missing arguments that occurred Distance
away from each missing argument in the portion of training data.
This number was then
extracted to the entire data. Pr[ Pos = Act n Weight = K ] was calculated by dividing the
Number of Occurrences field by the total number of possible coreferents of missing arguments.
The rightmost field, Pr[ Weight = K ] was calculated by dividing the Possible Occurrences field
by the total number of possible coreferents of missing arguments in the training discourse.
The Coreferent Grammatical Function method and the Verb Form Correlation method
also require training on a set of data. The Number of Occurrences and the Possible Occurrences
fields as seen in tables 3-2 and 3-3 are the results of this training. The last two fields of these
tables correspond with the values that will be entered into equation 3-5 depending upon the
grammatical function of the given possible coreferent of a missing argument in the actual data.
14
15 out of a total 58 instances of missing arguments were used in the random sample.
44
Training Results for the Granunatical Function Method on MUC-I Data
GRAMMATICAL
FUNCTION
NUMBER OF
*POSSIBLE
Pr[Poss=ActA
Weight=K]
OCCURRENCES OCCURRENCES
Pr [ Weight=K
Subject
Object
26
5
205
128
.030
.006
.239
.149
MainVerb
0
184
.000
.215
Other Noun
4
214
.005
.250
Other Verb
1
67
.001
.078
Other**
14
58
.016
.068
*
This number is based on the estimated number of possible occurrences per sentence.
**
Other includes entities such as the listener or the speaker of the discourse.
TABLE 3-2: Training results for the Coreferent GrammaticalFunction method on MUC-Il data.
In table 3-2, the Grammatical Function column corresponds with the grammatical
function of the various entities in the training data with respect to each missing argument. For
example, in the example given previously, each of the actual coreferents perform different
grammatical functions:
The enemy ship shot down the aircraft. Considered hostile act. Was on a topsecret mission.
The referent of the first missing argument corresponds with shot down, and would fall
into the category, Main Verb on table 3-2 as shot down is the action that the subject is taking in
the sentence. The referent of the second missing argument in the above example corresponds to
aircraft and would fall into the category Subject in the table.
Enemy ship would also be
considered a possible coreferent of the two missing arguments, and would fall in the category
15
Subject, as it is the subject of the sentence'.
In table 3-3, the possible coreferent category and verb form column corresponds to
the category of each possible coreferent and the verb form in the sentence of the missing
15 It
should be noted that all of the numbers in the table are with respect to each missing argument. Each
entity is counted once for each missing argument of which it is a possible coreferent.
45
argument as determined by the Verb Form Correlation method explained in chapter two. In
the example paragraph, each of the missing arguments correspond to different verb forms in
table 3-3.
The enemy ship shot down the aircraft. Considered hostile act. Was on a topsecret mission.
The action that the first missing argument takes, considered, is in the passive tense.
Consequently, all possible coreferents counted in table 3-3 for this missing argument would fall
in a row that had the form Passive. To look at a specific example, the enemy ship is considered
as a possible coreferent of that missing argument, and would fall into the coreferent category,
Action Performer. Consequently, the enemy ship as a possible coreferent of the second missing
argument in the example paragraph would fall into the category Action Perf / Passive.
Training Results for the Verb Form Correlation Method on MUC-H Data
Possible Coreferent
Category and Verb Form
Speaker / Pluperfect
NUMBER OF
OCCURRENCES
7
10
.012
.017
Speaker / Present
2
2
.003
.003
Speaker/ Passive
0
11
.000
.018
Speaker / Other
1
18
.002
.030
ActionPerf./Pluperfect
3
30
.005
.050
ActionPerf./Present
0
3
.000
.005
ActionPerf./Passive
7
33
.012
.054
Action Perormer / Other
16
54
.026
.089
Other /Pluperfect
0
25
.000
.041
Other / Present
0
2
.000
.003
Other/Passive
4
28
.007
.046
Other / Other
1
48
.002
.079
*
*POSSIBLE
Pr [Poss = Act fl
Weight =K ]
OCCURRENCES
Pr[ Weight= K
This number is based on the estimated number of possible occurrences per sentence.
TABLE 3-3: Training results for the Verb Form Correlationmethod on MUC-II data.
46
Each of the other columns in table 3-2 in table 3-3 corresponds to the same calculation as
did those in table 3-1. The Number of Occurrencesfor these tables corresponds with the number
of times an actual coreferent of a missing argument falls in the category corresponding with the
first column.
The Possible Occurrences column corresponds with the number of times any
possible coreferent of a given missing argument falls into the category given in the first column.
The last two categories are calculated by dividing Number of Occurrences and Possible
Occurrences respectively by the total number of possible occurrences of all of the missing
arguments in the training data.
The result of the individual training of the Coreferent Action Capability method is a
lengthy list of noun-verb pairs. Included in each noun-verb entry in the output of the data is
whether or not the verb was in the passive tense. To determine the appropriate weight of the
Coreferent Action Capabilityalgorithm, then, the system used the noun-verb list produced from
training to determine if a given possible coreferent of a missing argument could perform the same
action that that missing argument performed. The results of this training can be seen in table 3-4.
The YES row corresponds to possible coreferents that are represented as a noun in the noun-verb
list with a corresponding action that is the same as that which the missing argument performs.
The NO row corresponds to those possible coreferents that are not found with a corresponding
verb in the list.
The result of training the Coreferent Description Capability method is similar to that of
the training of the Coreferent Action Capability method. Instead of a list of noun-verb pairs, the
training results in a list of noun-adjective pairs, where, ideally, the list encapsulates the relation
between nouns found in the discourse and the possible adjectives that can describe them.
Training Results for the Action Capability Method on MVUC-I[ Data
Is the Possible Coreferent Capable ofPerforing Pr [Poss= Act n Weight = K ] Pr [ Weight=K ]
the Action that the Missing Argunent Perfirms?
.019
.049
YES
NO
.047
.953
TABLE 3-4: Training results for the CoreferentAction Capabilitymethod on MUC-II data.
47
Training Results for the Description Capability Method on MUC-il Data
Is the Possible Coreerent Capable ofBeing
Described by an Adjective Similar it that which
Pr [Poss
= Act
A Weight=K ]
Pr [ Weight =K
Describes the Missing Argunent?
YES
.008
.027
NO
.060
.973
TABLE 3-5: Training results for the Coreferent DescriptionCapabilitymethod on MUC-II data.
Training Results for the Noun Category Method on MUC-H1 Data
Could the Possible Coreferent Fall into the Same
Noun Category as the Missing Argument?
Pr [Poss = Act A1Weight=K ] Pr [ Weiht=K ]
ig
YES
.003
.008
NO
.064
.992
TABLE 3-6: Training results for the Coreferent Noun Category method on MUC-II data.
Consequently, table 3-5 is similar to table 3-4 in terms of its entries.
The YES row
corresponds to those possible coreferents of missing arguments that the system determined were
capable of being described by an adjective that also described the missing argument. The NO
row corresponds to those possible coreferents for which the system did not find a noun-adjective
entry in the description capability list.
Neither the Coreferent Noun Category method nor the Coreferent Action Recency
method requires individual training. Tables 3-6 and 3-7 show the correspondence between the
output of these two methods and the values that will be inserted into equation 3-5.
Training Results for the Action Recency Method on MUC-IH Data
Was a Similar Action Recenrty
Perforiedby the Possible Corefernt ?
Pr [Poss= Act AWeight=K]
Pr[Weight=K]
YES
.007
.008
NO
.060
.992
TABLE 3-7: Training results for the CoreferentAction Recency method on MUC-II data.
48
For tables 3-4, 3-5, 3-6, and 3-7, the entries were calculated from the training on the
system.
Pr[ Pos=Act n
Weight=K ] was determined by dividing the number of actual
coreferents that were assigned a YES or a NO (depending upon the row the entry is in) by the
corresponding algorithm by the total number of possible coreferents in the training data. The
value Pr[ Pos=Act ] was determined by dividing the number of possible coreferents that were
assigned a YES or a NO (depending upon the row the entry is in) by the corresponding algorithm
by the total number of possible coreferents in the training data.
3.3 TRAINING SUMMARY
In general, it is not possible to determine which of the seven algorithms is best for
determining a possible coreferent. It should be noted, however, that a few of the algorithms are
extremely accurate when they output a certain way. The Coreferent Action Recency algorithm,
for example correctly determines the missing argument with 87% accuracy (on the training data),
however it is only applicable about ten percent of the time.
In general, the first three methods (Sentence and ParagraphDistance, Coreferent Grammatical
Function, and Verb Form Correlation)seem to be applicable to every possible coreferent, but do
not strongly correspond with the actual coreferent. The last four methods, (Coreferent Action
Capability, CoreferentDescription Capability, CoreferentNoun Category, and CoreferentAction
Recency,) however, seem to be only applicable to a small number of possible coreferents, but,
when they are applicable, their output corresponds strongly to the correct coreferent.
3.4 CHAPTER SUMMARY
Under the assumption of independence among the outputs of the seven algorithms
developed in chapter two, a simple equation for estimating the optimal combination of the outputs
is realizable. By applying the seven methods to a set of training data, information is gathered to
optimize this equation based on the outputs of the methods. This optimization is derived from a
simple probabilistic analysis. Using this training to optimize the combination algorithm should
optimize the system for when it is running on actual data, assuming the training data is similar to
the actual data.
49
CHAPTER 4
CONCLUSION
Missing argument referent determination and replacement can be a useful technique for
certain telegraphic military domains, and may extend into other areas of language. A missing
argument referent determination and replacement module has been integrated into the CCLINC
translation system at Massachusetts Institute of Technology Lincoln Laboratory. With additional
work, the technique can be improved and extended to other areas of interest.
This chapter first presents an overview of the accomplishments achieved and the
downfalls handled in the research work that is presented in this thesis. Also discussed are areas
for improvement upon the present state of the missing argument system as well as a few possible
applications of the work. A review of the goal of the thesis is presented, along with a brief thesis
summary.
4.1 SUMMARY OF RESEARCH WORK
The purpose of the work presented in this thesis is to develop a missing argument referent
determination and replacement module, and integrate the module into the CCLINC system. A
result of the completion of this task comprises an advancement in the field of missing argument
referent determination and replacement and automated natural language understanding as a
whole.
50
These tasks have been accomplished primarily through the development and integration
of seven separate methods into a single missing argument module. Some of the methods were
developed individually; others are based on previous work done by other computational linguists.
However, the achievement attained through the evolution of any one individual method is
dwarfed by the accomplishment of the integration of each method into the whole missing
argument module.
Nonetheless, each individual method is, in itself, a necessary and important component of
the missing argument referent determination and replacement system.
The Sentence and
ParagraphDistance method weights possible coreferents based on their distance from the
missing argument. The Coreferent Grammatical Function method weights possible coreferents
based on their grammatical function. The Coreferent Verb Category method provides a weight
based on both the category into which the possible coreferent falls and the form of the verb
following the missing argument. The Coreferent Action Capability method bases its output on
whether or not a given possible coreferent is capable of performing the action taken by the
missing argument.
The Coreferent Description Capability method weights the possible
coreferent based on whether or not it can be described by the adjectives that describe the missing
argument. The Coreferent Noun Category method determines the weight assigned by evaluating
whether or not the possible coreferent can fall into the same noun category as the missing
argument. Finally, the Coreferent Action Recency method weights a coreferent based on whether
or not it is performing an action that is similar to the action that the missing argument is
performing. These seven methods provide the backbone of the system.
These methods are combined into the missing argument module based on the
mathematical evaluation of the results arrived at from training the system on a set of data. Once
the missing argument module is trained, it takes as input semantic frames generated by the
CCLINC system. It then stores each entity represented in the semantic frame input into a
separate data structure. When a missing argument in the semantic frame input is noticed by the
module, each component of the missing argument module weights the entities that are considered
possible coreferents. These weights are combined based on the results from the training of the
51
system, and each possible coreferent of that missing argument is assigned an overall weight.
Then, the system chooses the possible coreferent with the highest overall weight for a given
missing argument as the appropriate coreferent of that missing argument, and inserts an
appropriate form of this entity into the location of the missing argument. The module then
proceeds with the next semantic frame input.
4.2 FUTURE WORK
Like many research systems, there is no conclusive stopping point to the missing
argument referent determination and replacement research. The system can be improved upon by
increasing accuracy or adding functionality. The purpose of this section is to point out some of
the strings that have been reluctantly left untied.
Clearly, as explained in chapter two, the implementation of the integration of the
algorithms in the missing argument module of the CCLINC system allows for further algorithmic
development. A new algorithm merely needs to be developed and added into the system in the
same way as all of the other algorithms. The only requirement on the algorithm is that it weights
every possible coreferent of a given missing argument.
The concept of dynamically training the system is briefly mentioned in chapter three.
Although potentially complex, the simple mathematical model used in combining the various
algorithms may provide for a simple method to dynamically train the system. Dynamic training
would certainly be extremely advantageous, as one of the major pitfalls of the present missing
argument system is that it requires training on a set of data with the correct coreferents of the
missing arguments manually marked. Removing this requirement would be useful indeed.
Integrating a word sense disambiguation technique that would allow WordNet to reliably
create the correct list of synonyms and hypernyms for the methods that use these relations would
also improve the system.
In doing this, the Coreferent Action Capability method, Coreferent
Description Capability method, Coreferent Noun Category method, and the Coreferent Action
Recency method would all be improved.
52
The work presented in this thesis has primarily taken the direction toward resolving
missing argument ambiguities in sentences where only the argument is missing; in general, the
methods do not work well for stand-alone noun phrases. A unique possibility for future work
would be to direct attention solely to these constructs.
The techniques used in the missing argument referent determination and replacement
module are applicable in other fields of research as well as in translation.
For example,
applications where speech is converted to text by a computer may often encounter noise.
Whether this is electrical, vocal, or static noise, portions of speech may be unintelligible.
Techniques developed from this research could potentially recover unintelligible portions of the
input.
A similar application exists for scanning in text from hard copies and identifying the
actual letters and words.
With the advent of computers, there is a driving force to provide
electronic versions of all hard copy work. One purpose of this is to allow useful texts to be
accessible via the Internet.
To store the hard copies efficiently, however, it is necessary to
convert the scanned images into ASCII text. Even the best image to ASCII converters are only so
accurate.
A possible improvement upon the accuracy would be to incorporate a missing
argument referent determination module into the system so that unintelligible or questionable
translations can be verified.
4.3 THESIS SUMMARY
In chapter one, four goals of the thesis were presented.
understanding,
automated
natural
language
translation,
and
determination and replacement were presented in chapter one.
Automated natural language
missing
argument
referent
A description of the seven
methods for missing argument replacement and referent determination was given in chapter two.
Chapter three described the incorporation of these seven methods into CCLINC system module.
Finally, this chapter presented conclusions from the implementation, as well as a view into the
future work of missing argument referent determination and replacement at MIT Lincoln
Laboratory in general.
53
BIBLIOGRAPHY
1.
Allen, James.
Natural Language Understanding, Second Edition. Benjamin/Cummings
Publishing Company Inc.: Redwood City, California, 1995.
2. Dahl, Deborah and Catherine Ball. Reference Resolution in PUNDIT, Unisys Corporation:
Chapter 8 in Logic and Logic Grammars for Language Processing, Saint-Dizier,
Patrick and Stan Szpakowiz, editors: From Ellis Horwood Series in AL.
3. Fromkin, Victoria and Robert Rodman. An Introduction to Language, Sixth Edition.
Harcourt Brace College Publishers: Fort Worth, Texas, 1998.
4. Hutchins, John, and Harold. Somers. An Introduction to Machine Translation. Academic
Press Limited: London, England, 1992.
5. Hwang, Jung-Taik. A Fragmentation Technique for Parsing Complex Sentences for
Machine Translation. Massachusetts Institute of Technology Lincoln Laboratory:
Lexington, Massachusetts, 1997.
6. Lee, Young-Suk, Clifford Weinstein, Stephanie Seneff, and Dinesh Tummala. Ambiguity
Resolution for Machine Translation of Telegraphic Messages. Proceedings of the
3 5th
Annual for Computational Linguistics: Madrid, Spain, 1997.
7. Lin, Dekang. Description of the PIE System Used for MUC-6. Department of Computer
Science, University of Manitoba: Manitoba, Canada, 1996.
8. Miller, George,
Thomas.
Martin Chodorow, Shari Landes, Claudia Leacock, and Robert
Using a Semantic Concordancefor Sense Identification. Cognitive Science
Laboratory: Princeton University, Princeton, 1994.
9. Palmer, Martha, Deborah Dahl, Rebecca Schiffman, Lynette Hirschman, Marcia
Linebarger, and John Dowding. Recovering Implicit Information. Unisys Corporation.
Initial printing; Proceedings of ACL, 1986.
54
10. Park, Hyun, Dania Egedi, Martha Palmer.
Recovering Empty Arguments in Korean.
Institute for Research in Cognitive Science, University of Pennsylvania: Philadelphia,
Pennsylvania, 1995.
11. Seneff, Stephanie. Tina: A Natural Language System for Spoken Language Applications.
Laboratory for Computer Science, Massachusetts Institute of Technology: Cambridge
Massachusetts, 1992.
12. Seneff, Stephanie, Dave Goddeau, Christine Pao, and Joe Polifroni.
Discourse Modeling in a Multi-User Multi-Domain Environment.
Multimodal
Laboratory for
Computer Science, Massachusetts Institute of Technology: Cambridge, Massachusetts,
1996.
13. Tummala, Dinesh, Stephanie Seneff, Douglas Paul, Clifford Weinstein, and Dennis
Yang. CCLINC: System Architecture and Concept Demonstration of Speech-to-Speech
Translationfor Limited-Domain Multilingual Applications. Proceedings of the 19995
ARPA Spoken Language Technology Workshop. Massachusetts Institute of Technology
Lincoln Laboratory: Lexington Massachusetts, 1995.
14. Walker, Marilyn, Massayo Ida and Sharon Cote. Japanese Discourse and the Process of
Centering. The Institute for Research In Cognitive Science, University of Pennsylvania:
Philadelphia. Pennsylvania, 1992.
15. Weinstein, Clifford, Young-Suk Lee, Stephanie Seneff, Dinesh Tummala, Beth Carlson,
John T. Lynch, Jung-Taik Hwang and Linda Kukolich. Automated English / Korean
Translation for Enhanced Coalition Communications.
Volume 10, Number 1, Lexington Massachusetts, 1997.
55
Lincoln Laboratory Journal: