MISSING ARGUMENT REFERENT IDENTIFICATION IN NATURAL LANGUAGE by Jeffrey Foran Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the Degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February, 1999 @ Jeffrey Foran, MCMXCIX. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. Author and Computer Science February 3, 1999 UU Certified by ( Clifford J. Weinstein Group Leader, MIT Lincoln Laboratory Thesis Supervisor Certified by 77 I, V Young-Suk Lee atory visor Accepted by Smith Chairman, Departmental Committee on Graduate Studies MISSING ARGUMENT REFERENT IDENTIFICATION IN NATURAL LANGUAGE by Jeffrey Foran Submitted to the Department of Electrical Engineering and Computer Science on February 3, 1999, in partial fulfillment of the requirements of the degrees of Bachelor of Science in Computer Science and Master of Engineering in Electrical Engineering and Computer Science ABSTRACT Missing arguments are common in spoken language and in telegraphic messages. Identifying the correct linguistic interpretation of a sentence with a missing argument often requires identifying a coreferent of that missing argument. Consequently, automated natural language understanding systems that do not correctly identify a replacement for missing arguments within a sentence will often derive an incorrect semantic or pragmatic interpretation of that sentence. The solution lies in identifying a correct referent of the missing argument and replacing the missing argument with that referent before generating the outputted semantic representation. The technical challenge remains in calculating the appropriate referent of the missing argument. This thesis describes methods for automatically identifying coreferents of missing arguments within a discourse before the completion of semantic and pragmatic evaluation, allowing for the insertion of these referents into the discourse in a manner which resolves ambiguities while minimizing the number of inaccurate insertions. Applying these methods to a discourse with multiple missing arguments results in a discourse that is more comprehensible to both man and machine. THESIS SUPERVISOR: Clifford J. Weinstein TITLE: Group Leader, MIT Lincoln Laboratory THESIS SUPERVISOR: Young-Suk Lee Staff, MIT Lincoln Laboratory TITLE: 2 ACKNOWLEDGMENTS I would first like to thank Young-Suk Lee, my thesis advisor who has directly guided me through the past year of my research. The many discussions that we have had over this last year concerning my research have been invaluable to my experience as a student. Her drive and her love of her work has rubbed off on me, and I now look forward to working in the exciting field of computational linguistics. Secondly, I would like to thank Clifford Weinstein, without whom I would never have been able to work on this project. I would also like to thank Linda Kukolich, who has provided an invaluable and unforgettable atmosphere during the past several months as an office mate. I could not hope for a more congenial and understanding friend to share a workspace. No neurons were wasted in our many discussions on many topics. Next, I would like to thank the many other people in Group 49 with whom I have had the pleasure of working and interacting. I would especially like to thank the members of the Translation Project whose work provided the foundation for this thesis. I would like to thank my closest friends and colleagues who have consistently provided a unique and positive college experience. Working, playing, partying and arguing with you have made my life enjoyable. Without you, my studies at MIT would have been a trivial accomplishment. I also thank MIT for maintaining such an intriguing and worthwhile atmosphere for learning and for growing up. I have accomplished more at MIT than I ever thought was mentally and physically possible. I must also thank Beer, as it has kept me sane through my extreme experiences at MIT. And in thanking Beer, I thank Cornwalls and its kind and generous staff as well, may serve good beer for the rest of eternity. I finally thank my parents, as I would be nothing without them. They always guided me to become whatever I dreamed, and now I am here. I thank you. 3 TABLE OF CONTENTS 1. 2. . . . . . 6 1.1 Missing Argument Referent Identification . . . . . 7 1.2 The CCLINC System . . . . . 8 1.3 Research Goals . . . . 11 1.4 Thesis Goals . . . . 11 INTRODUCTION . . . . . . 1.5 Research Domain . . . . . . . . 12 1.6 Summary . . . . . . . . 12 . . . . . . . . 13 2.1 Structure of the Missing Argument Module . . . . . 14 2.2 Seven Components of the System . . . . . 17 2.2.1 Sentence and Paragraph Distance . . . . . 18 2.2.2 Coreferent Grammatical Function . . . . . 21 2.2.3 Verb Form Correlation . . . . . 24 2.2.4 Coreferent Action Capability. . . . . . 27 2.2.5 Coreferent Description Capability . . . . . 31 2.2.6 Coreferent Noun Category . . . . . . 35 2.2.7 Coreferent Action Recency . . . . . . 36 . . 38 . 39 . SYSTEM COMPONENTS 2.3 Chapter Summary . . . . . . . . . . . . . . . . . . . 39 3.2 Results from Training the Algorithms . . . . . 43 3.3 Training Summary . . . . . 49 . . . . . 49 . 50 3. SYSTEM TRAINING. 3.1 A Method for Combining Weights . 3.4 Chapter Summary 4. . CONCLUSION . . . . . . . 4.1 Summary of Research Work . . . . . . . . . . 50 4.2 Future Work . . . . . . . . 52 4.3 Thesis Summary . . . . . . . . 53 4 LIST OF FIGURES FIGURE 1-1 CCLINC Translation System Overview FIGURE 1-2 CCLINC Parse Tree FIGURE 1-3 CCLINC Semantic Frame FIGURE 2-1 Semantic Frame with Missing Argument FIGURE 2-2 Semantic Frame with Empty Placeholder LIST OF TABLES TABLE 3-1 Training Results for the Sentence and Paragraph Distance Method TABLE 3-2 Training Results for the Coreferent Grammatical Function Method TABLE 3-3 Training Results for the Verb Form Correlation Method TABLE 3-4 Training Results for the Coreferent Action Capability Method TABLE 3-5 Training Results for the Coreferent Description Capability Method TABLE 3-6 Training Results for the Coreferent Noun Category Method TABLE 3-7 Training Results for the Coreferent Action Recency Method 5 CHAPTER 1 INTRODUCTION Automated natural language understanding is the basis of present-day, state-of-the-art machine translation systems. One approach to machine translation is an interlingua based approach as presented in An Introduction to Machine Translation, by Hutchins and Somers [4]. In this approach, the input language is first translated into a language neutral meaning representation, which is then translated into the target language. The Common Coalition Language System (CCLINC) at Massachusetts Institute of Technology Lincoln Laboratory is one such system. The integration of a missing argument referent identification and replacement module into such a system has the potential to enhance the system's translation accuracy on discourses with missing arguments. This potential exists because the missing argument module can resolve some of the ambiguity inherent to these discourses. This chapter introduces automated missing argument referent identification and its orientation to the CCLINC system. It then provides an overview of the CCLINC system. After the overview of the CCLINC system, the goals of the research described in this thesis are given, as well as the goals of the thesis itself. Finally, the domain on which this thesis and the corresponding research work was implemented is described. In general, this chapter introduces the concept of missing argument referent identification and explains why it may be an appropriate extension of the CCLINC translator. 6 1.1 MISSING ARGUMENT REFERENT IDENTIFICATION In order to best perform natural language translation, it is imperative to understand the meaning of the input language. In order to accurately understand a sentence within a discourse, it is often necessary to extract information that is found external to the sentence in question. For example, if the sentence, Webster received a hit from one, appeared in a discourse, and its translation was attempted without taking into account its meaning within a discourse, the translation output could be incorrect. The meaning of this sentence becomes clear when its context in the discourse is examined: Webster sighted three torpedoes. Webster received a hit from one. The ambiguity in the sentence, Webster received a hitfrom one, arises due to the fact that it cannot be intuitively determined that Webster is a ship by examining the sentence by itself. Thus, determining an accurate meaning representation of a given sentence can require identifying the context in which the sentence occurs. Missing arguments (missing subjects or objects within a sentence) often provide similar ambiguities to those seen in the example given above. This similarity can be easily recognized by examining the effects of removing the argument, Webster, from the initial sentence. In the same sense that the initial sentence was ambiguous because of the uncertain meaning of the word, Webster, the new sentence with Webster omitted is ambiguous because it is difficult to discern the precise meaning of the sentence, when examined by itself. Determining that a ship is the appropriate coreferent of this missing argument helps to disambiguate the meaning of the sentence, as knowing that a ship received a hitfrom one enables the identification of the meaning of the words, received, hit, and one. Consequently, identifying the coreferents of missing arguments within a discourse allows for a more accurate understanding of an input sentence. One of the applications of automated natural language understanding is machine translation. It has recently become clear that accurate machine translation of natural language is 7 difficult without adequate natural language understanding. Since much of machine translation relies on accurately understanding an input discourse, the improved ambiguity resolution that missing argument referent determination accomplishes may lead to an improvement in the accuracy of machine translation. To recapitulate, missing argument referent determination is the process of ascertaining the appropriate entity that would fit in place of a missing argument in a given discourse in order to resolve some of the ambiguity inherent to sentences with missing arguments. Automated natural language understanding and, more specifically, the CCLINC system's method of interlingual machine translation has driven the research of missing argument referent determination described in this thesis. By incorporating a missing argument referent determination and replacement module into a machine translation system such as the CCLINC system, certain ambiguities that arise due to missing arguments can be resolved, potentially resulting in improved machine translation accuracy. 1.2 THE CCLINC SYSTEM The Information Systems Technology Group at MIT Lincoln Laboratory has developed an automated English to Korean and Korean to English translation system, called the Common Coalition Language System, or the CCLINC system'. A primary goal of the CCLINC system is to provide enhanced C4I2 for military forces by providing an aid for translation. The interlingual approach used by the CCLINC system allows for a simple extension of the system in order to translate other languages. The recent development of the system, however, has been focused on performing translation between English and Korean. The CCLINC system utilizes a semantic frame language as an interlingual representation of natural language. All input texts are first translated into this semantic frame language through the use of a language understanding module. After the semantic frame representation of the input sentence is created, the system hands the work off to a generation module, which then produces 'Continued work and improvement upon the CCLINC system is in progress. 8 sentence I fullparse statement predicate subject I I verb-phrase noun-phrase det nn head verb:go v-dirprep-phrase verbto dir object noun-phrase the man went to det nn head the store I I FIGURE 1-2: EXAMPLE CCLINC parse tree output for the sentence, "the man went to the store." Once the parse tree for a given input sentence is built, the system generates the semantic frame. The semantic frame for, the man went to the store, is given in figure 1-3. The importance of the semantic frame is that, if it constructed correctly, it captures the core meaning of the input sentence. Prior to the incorporation of missing argument referent determination and replacement, the CCLINC system created the semantic frames in a relatively context free manner with respect to the discourse. {c statement :topic {q person :name "man" :pred {p the :global 1} } :subject 1 :pred {p go.direction_v :mode "past" :pred {p vjto-goal :topic {q location :name "store" :pred {p the :global 1}I } }} FIGURE 1-3: CCLINC semantic frame construction for the sentence, the man went to the store. 10 the output of the system. Figure 1.1 is a graphical representation of the CCLINC system structure, as provided in Automated English / Korean Translation for Enhanced Coalition Communications by Weinstein, Lee, Seneff, Tummala, Carlson, Lynch, Hwang, and Kukolich [15]. Because of the interlingual approach that the CCLINC system takes, the system can be extended to any new language by constructing the appropriate lexicon and grammar for a given language, and by adapting that lexicon and grammar into the language understanding and generation modules. When given an input sentence, the understanding module of the CCLINC system first syntactically labels input sentences. This syntactic labeling in consists of identifying the structural representation of individual words and phrases through the creation of a hierarchical tree called a 'parse tree'. Figure 1-2 is an example of a parse tree that CCLINC would create for the input sentence, the man went to the store with his wife. ENGLISH E TEXT OR SPEECH SEMANTIC FRAME (COMMON COALITION LANGUAGE) GENERATION FIGURE 1-1: CCLINC translation system overview. 2 Command, Control, Communications, Computer, and Intelligence. 9 f) KOREAN TEXT OR A SPEECH 1.3 RESEARCH GOALS One of the domains that the CCLINC system has been designed to translate includes telegraphic military text with many missing arguments. Other domains that may eventually need to be translated may also have missing arguments. Translating individual sentences with missing arguments in a discourse independent manner, even if the missing argument is readily discernible from the input in the input language, can result in output sentences that are either semantically incorrect or that have missing arguments whose referents cannot be easily determined in the output language. Consequently, the ability to accurately determine and replace missing arguments during the creation of the intermediary semantic frame representation can potentially lead to a higher translation accuracy in the CCLINC system. One goal of the work described in this thesis is the development a module that uses an intermediary semantic frame representation of an input discourse to automatically and accurately discern the referents of missing arguments within that discourse. The application of this module will ideally alleviate some of the ambiguities that are caused by the occurrence of missing arguments within a discourse. This application could potentially enable a better-automated understanding and consequent machine translation of input discourses with missing arguments. A separate goal of the research work described in this thesis is related to the theoretical aspects of natural language understanding. By designing and developing a unique method to determine the referents of missing arguments in a discourse, the work will hopefully extend the field of missing argument referent determination in general. 1.4 THESIS GOALS The goal of this thesis, then, is four fold: to describe methods for automatic missing argument replacement and referent determination; to evaluate the reasons behind these methods; to describe the incorporation of these methods into a comprehensible module in the CCLINC system; and to derive conclusions from the theoretical and implementational work performed. The components of the missing argument referent determination and replacement module are discussed in chapter two. The elements of training and the explanation of the development of the missing argument module are explained in chapter three. Conclusions from the work as well as a look into the future of automated missing argument replacement and referent determination, both with relation to the CCLINC system and separate from the CCLINC system, are found in chapter four. 11 1.5 RESEARCH DOMAIN The work described in this thesis maintains a basis in the MUC-II military domain. The MUC-II data is a primary basis of the work described herein for three reasons. First, its accurate translation was the original interest of the CCLINC system. Second, it contains many instances of missing arguments. Finally, sentences in the MUC-II domain have been readily available during the development of the thesis work. The following is a paragraph that provides an example of the data found in the MUC-II domain: Friendly CAP aircraft splashed hostile MIG-22 proceeding inbound to Independence at 95 NM. Last hostile acft in vicinity. Air warning red weapons tight. Remaining alertfor additionalattacks. In this example, note that the last three sentences have missing antecedents. By determining the exact antecedents for sentences such as these, the machine and human comprehensibility of the discourse can be improved. It should be noted that MUC-II data cannot be directly published in this thesis. Consequently, proper names, times, distances, and locations have been altered in the presentation of this thesis only: test data and training data used by the system were not altered in this fashion. 1.6 CHAPTER SUMMARY This thesis presents a new method for determining coreferents of missing arguments in an input discourse and replacing those missing arguments with their appropriate coreferents. The thesis also describes how this method is integrated into the CCLINC system at MIT Lincoln Laboratory. Finally, this thesis explains the results of this implementation, and provides insight into future work in this area. 12 CHAPTER 2 COMPONENTS OF THE MISSING ARGUMENT MODULE Automatically identifying the appropriate referents of missing arguments requires a broad range of linguistic information. Advances in automated natural language understanding have allowed computers to make use of more and more linguistic information at higher and higher accuracies. As a consequence, the development of the missing argument referent determination system as described in this chapter is based on the present state of automated natural language understanding, and, more specifically, on the CCLINC translation system at MIT Lincoln Laboratory. This chapter presents the structure of the missing argument referent determination and replacement module and explains how this module fits into the CCLINC system. The chapter then describes the seven components of the missing argument module that are used to identify the appropriate referents of missing arguments in a given domain. These components each incorporate a unique factor that contributes to the final identification of the referent of a missing argument. The seven factors that correspond to these seven components are Sentence and ParagraphDistance, Coreferent Grammatical Function, Verb Form Correlation, Coreferent Action Capability,Coreferent DescriptionCapability, CoreferentNoun Category, and Coreferent Action Recency. The methods by which the seven components extract these factors from the input discourse form the basis of the missing argument referent determination and replacement module and are explained along with the description of the seven factors in this chapter. 13 2.1 STRUCTURE OF THE MISSING ARGUMENT MODULE The basic structure of the missing argument referent determination and replacement module was formulated from the characteristics of natural language and from those of the CCLINC system itself. The CCLINC has the characteristic that the intermediary stage of the system outputs a semantic frame representation of input sentences, on a sentence by sentence basis. One characteristic of natural language is that it provides the means to identify the correct coreferent of a missing argument by the time the sentence that contains the missing argument is understood by the listener or reader. Because of these two characteristics, the missing argument module first takes as input a semantic frame that is outputted by the CCLINC system, while remembering aspects of previous semantic frames. Then, the missing argument module performs some computation on the semantic frames up to and including the most recent semantic frame, attempting to identify the appropriate referent of the missing argument in the present frame, if one exists. Finally, it produces a semantic frame output with the missing argument replaced by the entity that the module determined to be its correct referent. Given that the missing argument module accepts a semantic frame as its input, specifying the method that the missing argument module uses to determine the appropriate referents of missing arguments within a discourse is relatively simple. If there is a missing argument within the present semantic frame, the missing argument module creates an 'empty' placeholder for that missing argument. The entity that the missing argument module identifies as the correct referent of the missing argument will eventually be inserted into that empty placeholder. {c statement :subject 1 :pred {p engagejtrv :mode "past" :topic {q artifact :name "aircraft" :pred {p the :global 1} :pred {p nn.mod2 :topic "red" :global 1) } } } FIGURE 2-1: The semantic frame output of the CCLINC system on the phrase, engaged the red aircraft,before the insertion of an empty placeholder for the subject. 14 For example, for the phrase 3 engaged the red aircraft,the missing argument module first accepts the output of the CCLINC system as its input, which, in this case, would resemble the semantic frame seen in figure 2-1. After the missing argument module inserts an 'empty' placeholder in the location of the missing argument, the semantic frame representation of engaged the red aircraftresembles figure 2-2. After this task is complete, the missing argument module creates a record for each entity in the present semantic frame that could possibly be a coreferent of a future missing argument. In the sentence, engaged the red aircraft,the missing argument, the action represented by the verb engaged, and the aircraftare all possible coreferents of future missing arguments. By examining the three sentences found in examples 2-1 through 2-3, the fact that these entities are all possible coreferents of future missing arguments becomes clear. The missing arguments in each of these three sentences, were they to follow, engaged the red aircraft, in the input discourse, would corefer with the missing argument, the 'engaging', and 'the aircraft' respectively4 . {c statement :topic {q empty-placeholder} :subject 1 :pred {p engagetr_v :mode "past" :topic {q artifact :name "aircraft" :pred {p the :global 1} :pred {p nn.mod2 :topic "red" :global 1) } } } FIGURE 2-2: The semantic frame output of the CCLINC system on the phrase, engaged the red aircraft,after the insertion of an empty placeholder for the subject. sentences given in this thesis are not directly from the MUC-II domain (as introduced in chapter one) unless specifically noted. The example here, is not is a MUC-I sentence. 4 At first glance the adjective 'red' may seem as if it could possibly be a coreferent of a future missing argument, this is not the case in the English language. In English, pronouns and missing arguments cannot refer to adjectives. In general, possible missing arguments (and pronouns as well) corefer only with nominal entities: noun phrases and nominal forms of verb phrases. 3Example 15 Also engaged the submarine. EXAMPLE 2-1: Example sentence whose missing argument corresponds with the missing argument of engaged the red aircraft. Resulted in a brutal destruction. EXAMPLE 2-2: Example sentence whose missing argument corresponds with the act of engaging in the sentence engaged the red aircraft. Looked as if it were orange from our perspective, however. EXAMPLE 2-3: Example sentence whose missing argument corresponds with the aircraft in the sentence engaged the red aircraft. Once the records are created for each possible coreferent of future missing arguments, the records are updated with all of the pertinent information about corresponding entity. The exact information that is inserted into these records depends upon the implementation of the seven components that comprise the system. After the records in the present semantic frame are filled in, if a missing argument occurs within the present semantic frame, the missing argument module attempts to identify the correct referent of this missing argument. The missing argument module accomplishes this by sequentially applying seven methods (described below) to each record that corresponds to a potential coreferent of the present missing argument. Each of these methods weights each of the potential coreferents based on the unique features of each potential coreferent and on the characteristics of the missing argument itself. Once all entities that are considered possible coreferents of the present missing argument are weighted by each of the seven methods, the missing argument module combines the output of the seven methods and assesses a final weight to each possible coreferent. The module then guesses that the highest-weighted possible coreferent is an actual coreferent of the missing argument. 16 Once the missing argument module chooses a final coreferent, the module inserts a copy of the chosen coreferent into the location of the missing argument in an appropriate lexical form. The module then outputs the semantic frame with the missing argument replaced by the chosen entity. This entire task is repeated until the end of the discourse is reached. 2.2 SEVEN COMPONENTS OF THE SYSTEM Seven primary components of the missing argument module work together toward the goal of missing argument referent identification. These seven components are based on seven general linguistic factors that differentiate possible coreferents of missing arguments. These seven factors and the implementation of their corresponding methods are described in this section. Four of these seven factors require the module to evaluate certain relations on words and word senses. For example, determining the Coreferent Action Recency factor (described in section 2.2.7) for a given possible coreferent requires identifying the synonyms of various verbs. The capability to produce various relations given a word or word sense is incorporated into the missing argument module through the use of a lexical and semantic resource called WordNet 5. For the purposes of the missing argument module, WordNet provides natural language relations between words and word senses, and is a major contributor to the methods that evaluate the Coreferent Action Capability, the Coreferent Description Capability, the Coreferent Noun Category, and the Coreferent Action Recency factors (described in subsections 2.2.4 through 2.2.7.) The application of the seven individual methods that evaluate the seven linguistic factors is a simple task for the missing argument module, as each method is applied separately. This, however, provides the constraint upon the methods that they embody an algorithm that independently weights every possible coreferent of each missing argument within a discourse. Once the set of seven weights for each possible coreferent of a given missing argument is 5 George Miller, Principal Investigator. Princeton New Jersey. WordNet v1.6. Developed at the Cognitive Science Laboratory, 17 determined by the system, the system optimally combines the weights and outputs a total overall weight for each possible coreferent. While the seven methods that incorporate the seven linguistic factors are described below, the method used for this combination is described in chapter three. 2.2.1 SENTENCE AND PARAGRAPH DISTANCE The first factor that can be used to weight possible coreferents of a missing argument is Sentence and ParagraphDistance. The method that encapsulates this factor is founded on concepts presented in Description of the PIE System Used for MUC-6, by Dekang Lin [7]. The basis of this factor is simple: those possible coreferents that are further back in the discourse from the missing argument are less likely to corefer with the missing argument. For example, one might expect that an actual coreferent of a missing argument is located in the previous sentence more often than it is located eight sentences back. In addition, it is expected that those possible coreferents of a missing argument that occur in a previous paragraph also have a lower probability to be the actual coreferent of that missing argument than those that occur in the same paragraph. In general, possible coreferents of missing arguments that are different distances from the missing argument should be weighted according to the probability that the actual coreferent is going to be found at that distance.6 All that is left to determine is the likelihood that the actual coreferent is going to be found at each possible sentence distance. The appropriate weight for each sentence distance and paragraph distance can be determined using a simple statistical calculation. This method requires a training discourse with the missing arguments of the training discourse marked and their coreferents correctly filled in. To determine the appropriate weights for each of the possible sentence distances, the algorithm trains on this data and determines the percentage of actual coreferents found at each possible sentence distance in the training data. Additionally, during this training, the percentage of actual The fact that not all coreferents have the possibility of existing a certain number of sentences prior to and in the same paragraph as the missing argument should also be taken into consideration. In the extreme, for example, we may determine that whenever a missing argument has the possibility of occurring four sentences back, that it always does. Consequently, the appropriate weight for a sentence distance of four would be lower than it should be, if this were not taken into consideration. 6 18 coreferents that occur in the present paragraph versus previous paragraphs can be calculated. Pseudo code for this training proceeds as follows: For each missing argument in the training data: increment counters SN[O] through SN[N] where there are N sentences preceding the missing argument and following the previous paragraph break. increment counters PN[O] through PN[M] where there are M paragraphs preceding the missing argument and following the beginning of the discourse. If the actual coreferent of the missing argument is in the same paragraph as the missing argument: SD <- the number of sentences between the missing argument and its appropriate coreferent PC[0] <- PC[0] + 1 SC[SD] <- SC[SD) + 1 Else: PD <- the number of paragraphs between the missing argument and its appropriate coreferent PC[PD] <- PC[PD] + 1 Next missing argument Separately Calculate: SperP <- average number of sentences per paragraph in the training data. Note that after the training is complete, the following variables are set: SperP: Average number of sentences per paragraph in training data PN[X]: Number of times a paragraph existed that was X paragraphs prior to a missing argument. SN[X] : Number of times a sentence existed in the training discourse that was X sentences prior to a missing argument in the same paragraph as that missing argument. PC[X]: Number of times the actual coreferent of a missing argument was found X paragraphs prior to the missing argument itself. SC[X] : Number of times the actual coreferent of a missing argument was found X sentences prior to the missing argument itself, given it was in the same paragraph as that missing argument. 19 Once the training is complete, the algorithm is ready to run on the actual data. The multiplication of the paragraph distance weight for the present paragraph by the sentence distance weight for the present paragraph can be used as the weight for possible coreferents that are located in the same paragraph as the missing argument at that sentence distance. The weighting of possible coreferents that are located in previous paragraphs ignores the sentence distance weights and instead uses the weights that are calculated for the paragraphs instead. This is because empirical data suggests that sentence distances in previous paragraphs minimally impact the probability that an actual coreferent is found in that previous paragraph. This combination of sentence distance and paragraph distance provides a reasonably accurate representation of the expected location of the actual coreferent of a missing argument, assuming the training data is a representative sample of the data that the system will be run on. When a missing argument is encountered in the actual data, (and the referent is unknown,) the algorithm weights each previous possible coreferent according to the percentages accumulated during training. The following pseudo-code provides this function: For each possible coreferent C of a given missing argument MA: P <- number of paragraph breaks between C and MA S <- number of sentences between C and MA (same sentence = 0) If P = 0 then SPDWeight <- (PC[0]/PN[0]) * (SC[S]I/SN[S]) Else: SPDWeight <- (PC[P]/PN[P]) * (1 / SperP) Next C In essence, SPDWeight corresponds to the percentage of times in the training data the correct coreferent was the same distance from the missing argument as the possible coreferent is in the actual data. This weight only factors in sentence distance if the possible coreferent is in the same paragraph as the missing coreferent. 20 The necessity for dividing the SPDWeight by the number of times a sentence or paragraph distance was seen in the training data arises because it is imperative to determine the percentage of times the potential coreferent is the correct coreferent for a certain sentence distance given that that sentence distance is available. The reason for dividing the SPDWeight by the average number of sentences per paragraph is that it is necessary to convert all weights to a per-sentence basis in order to keep them uniform across all possible coreferents of a missing argument7. The Sentence and ParagraphDistance method works best for resolving ambiguities that are otherwise not resolvable when possible coreferents are different distances from one another. For example, in the following paragraph from the MUC-Il data, the last sentence, shot down at 1532, is ambiguous as to what previous element in the paragraph was shot down; the A-6 or the P-3 could be the appropriate referent. CVW-42 launched a strike againstLand5 AFB. No bogey confrontation. TOT was for the A-2 1045. Ordinance expended. A hostile MIG-22 was intercepted byfriendlyfighters. Shot down at 1112. By applying this method to this example, the system can correctly determine the appropriate coreferent. In general, humans assert that the most recent reasonable entity that fits as a coreferent to a missing argument is the correct coreferent. The Sentence and ParagraphDistance method subsumes this natural human assertion for the missing argument module. 2.2.2 COREFERENT GRAMMATICAL FUNCTION The second factor that can be used to weight possible coreferents of a missing argument is the Coreferent GrammaticalFunction factor. The method that incorporates this factor into the module is founded on ideas presented in Japanese Discourse and the Process of Centering by 7 It would be ideal to convert all weights to a per-possible-coreferent basis. However, the assumption that the number of coreferents in a given sentence is independent of the sentence location in a sentence permits 21 Walker, Massayo and Cote [14]. The idea behind this factor is that different coreferents that correspond to different parts of the grammatical structure of a sentence have different likelihoods of being referred to by a missing argument. For example, one might expect that the actual coreferent of a missing subject is more likely the subject of a previous sentence than it is the indirect object. In general, possible missing argument coreferents that have different grammatical functions in previous sentences should be weighted according to the probability that the actual coreferent is going to have that grammatical function within a sentence. 8 The appropriate weight assigned by the missing argument module for this factor for a given possible coreferent of a missing argument can be determined by applying a method that is similar to that which calculates the sentence distance heuristic. Like the Sentence and Paragraph Distance factor, the Coreferent GrammaticalFunction factor requires a training discourse with the missing arguments of the training discourse marked and their coreferents correctly filled in. To determine the appropriate weights for each possible grammatical function, the algorithm trains on data by determining the probability that an actual coreferent of a missing argument performs each possible grammatical function. The training proceeds according to the following pseudocode: Let <GRAM FUNC> E { Subject, Other } Main Verb, Object, Other Noun, For each missing argument: If the actual coreferent acts as a <GRAM FUNC>: <GRAM FUNC>ActualCount < <GRAM FUNC>ActualCount Other Verb, + 1 For each possible coreferent of that missing argument9: If the possible coreferent acts as a <GRAM FUNC>: <GRAM FUNC>PossibleCnt (- <GRAM FUNC>PossibleCnt + 1 Next possible coreferent Next missing argument simplification of the algorithm to a per-sentence basis. This is done solely for the purposes of simplifying the analysis. 8 Similar to the Sentence and ParagraphDistance heuristic, it is necessary to take into consideration the fact that a certain structural parts of a sentence, such as the indirect object, do not occur as often as other structural parts, such as the subject of a sentence. Consequently, the appropriate weight for such structural parts must be adjusted by the number of times the are seen relative to other structural parts of a sentence. 9 The possible coreferent of the missing argument is what the system would consider to be a possible coreferent if it did not know of the actual coreferent, even though the actual coreferent is tagged in the training data. 22 Note that after the training is complete, each 'number' variable represents the number of times its corresponding structural type was encountered as a possible coreferent of a missing argument, and each 'count' variable represents the number of times its corresponding structural type was encountered as an actual coreferent of a missing argument. To run the algorithm on actual data, the following actions are performed: Let <GRAM FUNC> E { Subject, Main Verb, Object, Other Noun, Other Verb, Other } For each possible coreferent C of a given missing argument MA: If C acts as a <GRAM FUNC>: CSTWeight (- ActualCnt<GRAM FUNC> / PossibleCnt<GRAM FUNC> Next possible coreferent In general, this method disambiguates by weighting a possible coreferent to a missing argument based on the percentage of possible coreferents of missing arguments with a specific grammatical function that are actual coreferents. The method is best suited for resolving ambiguities that are otherwise not resolvable when possible coreferents perform different grammatical functions within a discourse. For example, in the following short discourse, is considered a beautiful girl, is ambiguous as to whether it means that Emily is considered a beautiful girl or that Jill is considered a beautiful girl: Emily went to the store with Jill. Is considereda beautiful girl. In most normal discourses, the ambiguity is resolved because the subject of the previous sentence is more likely to be referred to in future sentences than another noun in that sentence. For these discourses, the coreferent grammatical function method would behave as expected and would choose 'Emily' as the appropriate antecedent to the second sentence. This method also takes into consideration discourses with abnormal sentence structure such as seen in the MUC-II corpus. For example, if the portion of the discourse previous to the example discourse resembled the 23 following example, then it would not be unreasonable to assume that 'Jill' is the appropriate antecedent to 'is considered a beautiful girl,' as this would continue the trend in the discourse, however awkward it may be. John went to the store with Jane. Is considered a tall woman. Jenny went to the store with George. Is considereda big boy. Billy went to the store with Sarah. Is considered a loud girl. The structural type algorithm would recognize these types of trends because it would train on similar sentences. 2.2.3 VERB FORM CORRELATION One problem encountered when handling missing arguments is that commands in a discourse cannot easily be distinguished from sentences with missing arguments. For example, humans automatically recognize the subject of the sentence when presented with the well-formed sentence, "go to the store." Speakers of English immediately recognize this as a command, and, being a well-formed sentence, know that the implied subject of the sentence is "you" or the listener of the sentence. When dealing with grammar that includes missing arguments, this is not necessarily the case, as seen in the following paragraph from the MUC-II data: USS Intrepid holding active subsurface contact fm unit. Contact tracking. Considercontact to have executed counter attack. Contact is threat. In this discourse, the sentence, consider contact to have executed counter attack, is not a command, even though it is in the root form. Consequently, whenever the automated missing argument encounters a sentence in the present tense with a missing subject, it cannot assume that it is a command. It must treat the sentence as if it had a missing argument until it determines otherwise. This brings about the purpose of the third factor that we can use to weight possible coreferents: verb form correlation. 24 Because human intuition assumes certain aspects about a missing argument based on the verb form following a missing argument, it is possible to develop a method that weights possible coreferents based on variations in these assumptions between different domains. In many domains, sentences that begin with a verb in the present tense can be assumed to corefer with the listener of that sentence. In other domains, such as in MUC-II, it is necessary to come up with a different system of weights that correspond to the different types of assumptions that readers make based on verb form. Thus, the third method for weighting possible coreferents measures the assumptions that humans make on the verb form of a verb that follows a missing argument. As with the first two methods, this requires training on data that has missing arguments marked with their appropriate coreferents. Once this is available, the system trains on this data and determines the percentage of missing arguments whose actual coreferents fall into certain categories given the form of the verb that follows the missing arguments. Then when the system runs on the real data, a possible coreferent that falls into one of these categories will be determined based on the verb form that follows the corresponding missing argument, and the percentage of times the training resulted in that category given that verb form. The appropriate categories of possible coreferents and different groupings of verb formations are dependent upon the size of the training data as well as the domain itself. The dependence upon the size of the training data arises because, if the proportion of the number of categories or the number of groups of verb formations to the size of the training data is too large, then there will not be enough trials from the training data to sufficiently populate the categories and groups. If there is not enough data for each of the categories and each of the verb formation groups, the statistical correspondence between the categories and the correct coreferent will be weakened. The reason why the exact categories of possible coreferents and different groupings of verb formations are dependent upon the domain on which the system will run can also be easily examined. For example, MUC-II domain is generally narrative. Because no referents to a listener are made no 'listener' category is necessary. On the other hand, whenever commands are 25 given in a non-narrative discourse, the missing argument 0 always refers to the listener, and the commands are generally in the present tense. For domains with such sentences, it would likely be useful to have 'listener' as a category of possible coreferents. Because the distribution of verb forms within a given discourse is domain dependent, the algorithm must account for different verb formation groupings depending upon the domain. The categories of coreferents and groupings of verb formations determined by the missing argument module were chosen by manually examining the data on which the system trained". The three resultant categories of possible coreferents used by the system are the speaker, people and items capable of action, and other possible coreferents. The four groups of verb formations used by the system are the non-passive present tense formations, passive formations, verb formations in the 'ing' form, and all others. Like the Sentence and ParagraphDistance algorithm and the Coreferent Grammatical Function algorithm, this algorithm requires a set of training data with the appropriate coreferents to the missing arguments marked. The algorithm then performs the following steps to train on the training data: Domain Specific: Let <FORM> e (Present, Passive, Pluperfect, Other) Let <CATEGORY> e (Speaker, People and Items Capable of Action, Other) Domain Independent: For each missing argument MA: MAVerb <- the verb that follows MA (that MA performs) For each possible coreferent C of MA: If C e <CATEGORY> And MAVerb E <FORM>: Number<CATEGORY><FORM> <- Number<CATEGORY><FORM> + 1 Next possible coreferent A <- actual coreferent of MA If A e <CATEGORY> and MAVerb E <FORM>: Count<CATEGORY><FORM> <- Count<CATEGORY><FORM> + 1 Next missing argument 10 As previously noted, even when the subject, 'you', is absent in an English command, the sentence remains grammatically well-formed, and consequently, the lack of a subject is not technically a missing argument. However it must be treated as such when dealing with a discourse with actual missing arguments. " Although it may be possible to automate the formation of coreferent categories and verb tense groupings, it is beyond the scope of this work. 26 The 'number' variables represent the number of times a specific category and a specific verb formation grouping occurs for each possible coreferent of each missing argument. The 'count' variables represent the number of times a specific category and a specific verb formation grouping corresponds to the actual coreferent of each missing argument. Once the training is complete, the system uses the information accumulated to run on the actual data. In general, for a given possible coreferent category and missing argument verb form, the system outputs the percentage of times the category and form occurred together when the possible coreferent was the actual coreferent. This can be calculated by the following pseudo- code: For each possible coreferent C given a missing argument MA: <CATEGORY> <- the category into which C falls <FORM> (- the verb formation grouping of the verb that corresponds to MA VTCWeight (- Count<CATEGORY><FORM> / Number<CATEGORY><FORM> Next possible coreferent The verb formation correlation algorithm is similar to the previous two algorithms in that it determines characteristics of the training data with respect to each possible coreferent of a missing argument, and then weights the possible coreferent according to the percentage of times those characteristics correspond with the characteristics of the actual coreferents in the training data. The characteristics that this algorithm uses, however, are the verb formation that corresponds to the missing argument and the category of the possible coreferent of that missing argument. 2.2.4 COREFERENT ACTION CAPABILITY The fourth factor that affects the output of the missing argument module is based on the concept that any given noun is only capable of performing a subset of all possible actions. The 27 following example demonstrates a sentence where the system could use this method to determine the referent of a missing argument: 1945 USS Intrepid sighted 2 torpedoes 3000 yds off stbd bow. Launched urgent asroc attack. Sighted third torpedo and launched third asroc attack based on wake locations. Washington in company with Intrepidnot believed attacked. Here, the second and third sentences have missing subjects. However, by noticing that the only entity mentioned prior to the third sentence that could possibly 'launch an attack' or 'sight a torpedo' is the ship, the USS Intrepid. If the module could recognize that ships can perform the actions launching something and sighting something, it would determine that Intrepid is a possible coreferent of the missing arguments in these sentences. Additionally, if the module could recognize that neither torpedoes nor yds nor bow nor asroc attack could perform these actions, then it could correctly identify the missing arguments above with this knowledge alone. The method that we can use to evaluate the capability of a possible coreferent to perform a given action is relatively straightforward. The method determines that a given noun is capable of performing a given action by noticing that that noun performed that action in the training discourse. As a consequence, the method first requires training on a set of input data, recording noun-verb pairs when any noun performs any action in that training data. Then, as long as the training data is similar to the actual data, the system can correctly determine some of the capabilities of nouns that it has previously encountered. When the system is trained on similar data as the test data, the fact that the module has encountered 'a ship' that 'sights' in the training data, but has not encountered 'a torpedo' or 'an asroc attack' that 'sights', will cause the system to correctly identify the missing argument coreferent in the case above. One problem with this method is that for very diverse domains, the method may require a large amount of training data. This problem arises because the method requires that the system encounter any given noun-verb pair in the test data with reasonably high probability. This may require massive amounts of test data. Because the MUC-II corpus is so small, an enhancement to this method might be useful. 28 To increase the coverage of this method without increasing the amount of training data required, it is useful to note that if something can fly, then it can also move. Additionally, if 'a plane' can perform some action, then knowing that 'an aircraft' is synonymous with 'a plane' implies that 'an aircraft' can perform that action as well. To generalize this concept, if something has the ability to perform a specific action, then it also has the ability to perform other more general actions that subsume this action. Additionally, if some entity is capable of performing a specific action, then other entities of which the original entity is a type can also perform that specific action. When one entity is a type of a second entity, the first entity is considered a hypernym of the second, more specific entity (and the specific entity is the hyponym of the more general entity.) To accurately determine the pairs of words that should be included in the list, the missing argument module can utilize hypernym and synonym relations provided for by WordNet. The use of the WordNet database allows the missing argument module to generate a list of hypernyms or synonyms given a precise meaning of a word. The following is pseudo-code for the appropriate implementation of the training: For each sentence in the training data: For each noun N in the sentence that performs some action V: If N is a proper noun: N <- improper equivalent of N of nouns WordNet considers to by synonyms NList <- the list of the appropriate sense of the noun N. of verbs WordNet considers to be hypernyms VList <- the list of the appropriate sense of the verb V. For each verb VL in VList: For each noun NL in Nlist: increment the count associated with the NL-VL pair in the noun-action-capability database Next noun Next verb Next noun Next sentence Unfortunately, appropriately determining the sense of the words in the output semantic frame is necessary for the appropriate use of the hypernym and synonym relations. Additionally, such sense disambiguation is beyond the scope of the work described in this thesis. This task 29 may be difficult in general due to the ambiguous nature of sentences with missing arguments. Consequently, the implemented system disambiguates the word sense by defaulting to the most common sense of the word according to WordNet. This will inevitably cause problems, as the most common sense of the word according to WordNet will often be the incorrect sense of the word, on account that WordNet was trained on a domain that is very different than the MUC-II domain. Future improvements upon this system may wish to focus in on this issue. However, even though the implemented missing argument module has this fault, this method can still be applicable as some correlation will exist between the output of this method and the likelihood that a given potential coreferent of a missing argument is a correct coreferent. Once the training is complete, when a missing argument performs an action in the test data, this method is capable of outputting a weight for each possible coreferent that is based on that possible coreferent's capability to perform the action that the missing argument performs. This capability is determined in the following manner from the list created by training: For each possible coreferent C of a given missing argument, MA: V <- the base verb corresponding to the action that ma performs If C is a noun N (- an improper noun corresponding to the possible coreferent If C corresponds to a verb phrase N <- the noun form of the main verb corresponding to C NVCount <- DatabaseCount(N,V) If NVCount > Threshold: CACWeight <- 1. 0 Else: CACWeight <- 0.0 Next possible coreferent The Coreferent Action Capability method is different than the previous methods in that the data it trains on does not require the appropriate missing argument coreferent to be marked. It is also unique in that training on any sort of data will likely be helpful for resolving coreferent action capabilities in any domain, due to domain independent properties of nouns and verbs. 30 2.2.5 COREFERENT DESCRIPTION CAPABILITY The fifth factor that can be used to weight a possible coreferent of a missing argument is the possible coreferent's description capability. The method for applying this factor is very similar to that for the Coreferent Action Capability factor, as it shares the concept that certain characteristics of the missing argument are likely to match the characteristics of its appropriate coreferent. In the description capability method, the algorithm first determines if the missing argument is described by an adjective. If it is, the weight that the method outputs is based on the likelihood that the potential coreferent would be described by the adjective that describes that missing argument in the discourse. To apply this method, as with the application of the missing argument Coreferent Action Capability method, it is necessary to train the system on a set of data. In this method, however, the training consists of marking down noun-adjective pairs whenever an adjective describes a noun in the training data (instead of noun-verb pairs, as with Coreferent Action Capability.) Once the training is complete, the system has a list of nouns-adjective pairs such that all of the adjectives in the list described their corresponding nouns somewhere in the training data. This list can then be examined when the system comes to a missing argument that is described by some adjective; if a potential coreferent of the missing argument appears on the list with that same adjective, then that potential coreferent is a reasonable candidate as the actual coreferent, because it has the capability of being described by the adjective. If the potential referent does not appear on this list with the adjective that describes the missing argument, then, assuming the training data was large enough, that potential coreferent is not likely capable of performing the action that the missing argument performs. Consequently, the potential coreferent in question is not likely the actual coreferent of the missing argument. The hypernym and synonym relations can be applied to this method in a similar manner as to the Coreferent Action Capability method. This is done by not only testing whether or not the adjective that describes the missing argument appears on the noun-adjective list with the possible coreferent, but by also testing whether or not any of the synonyms of that adjective appear on the noun-adjective list with the possible coreferent. 31 For example, if the word 'building' were described as being 'gargantuan' in the training data and a missing argument in the input data was described as being 'enormous', then, since 'gargantuan' can describe 'building' and 'gargantuan' is a synonym of 'enormous', 'enormous' can describe 'building' as well. Consequently, whenever building-gargantuanis inserted into the noun-adjective list, buildingenormous can be inserted as well. A parallel concept is applicable to the nouns that are being described. By including in the noun adjective list all of the hypernyms of the noun that is described by some adjective during training, additional coverage of the algorithm can be achieved. For example, if the word 'motorcycle' were described as being 'fast' in the training data, then, since 'vehicle' is a hypernym of 'motorcycle', the system can infer that 'vehicles' can be described as being 'fast' as well. Consequently, when the pair, motorcycle-fast is inserted into the noun-adjective list, the entry, vehicle-fast, is inserted as well. In general, whenever a noun is described by an adjective in the training data, a list of hypernyms of that noun and a list of synonyms of that adjective are created. Each possible nounadjective pair in the product of these two lists are then inserted into the coreferent description capability pair list. The training for this method is very similar to that of the Coreferent Action Capability method, as seen in the following pseudo-code: For each noun N in the training data described by some adjective A: If N is a proper noun: N <- improper equivalent of N NList 4& the list of nouns WordNet considers to by hypernyms of the most appropriate of the noun N. AList - the list of adjectives WordNet considers to be synonyms of the appropriate sense of the adjective A. For each adjective AL in AList: For each noun NL in Nlist: increment the count associated with the NL-AL pair in the noun-adjective database Next noun in Nlist Next adjective in Alist Next noun 32 As explained previously in the Coreferent Action Capability section, appropriately determining the sense of the words in the output semantic frame is necessary for the appropriate use of the hypernym and synonym relations. Such sense disambiguation is beyond the scope of the work described in this thesis. Consequently, the implemented system disambiguates the word sense in this method by defaulting to the most common sense of the word in WordNet. This will inevitably cause problems, as the most common sense of a word according to WordNet will often be the incorrect sense of the word, on account that WordNet was trained on a domain that is very different than the MUC-II domain. Once the noun-adjective-capability database is built, if enough training data is supplied, the system should be able to determine if it is reasonable to describe a given noun with a given adjective, by looking up the pair in the database. After the database is built, this method weights each possible coreferent of each missing argument that is described by some adjective. The system uses a rule-based method to determine if a missing argument is described by an adjective. Such a rule might be: X is considered Y and Y is an adjective 4 Y describes X The automation of the creation of such rules is beyond the scope of the work described in this thesis, as experimental results have determined that a few simple rules in the case of the MUC-II domain cover many of the cases where a missing argument is described by some adjective. Thus, the rules used in the system could be developed by manually examining the training data. Once the rules determine that an adjective describes a missing argument, each noun-adjective pair (where the noun corresponds to each possible coreferent of that missing argument) is looked up in the noun-adjective-capability database. If the count for that nounadjective pair is above a certain threshold, then that possible coreferent is given a weight of one; if the count is not above that threshold, then the possible coreferent is given a weight of zero. 33 12 The exact location of the threshold will depend upon the size and variance of the training data. The pseudo-code that the system follows for determining this weight for a possible coreferent of a missing argument is as follows: For each possible coreferent C of a missing argument MA: S <- the sentence in which MA is located run through the set of rules on S If no rule matches S: CDCWeight <- 0.0 Else: A <- the adjective that the rule returns that describes MA NACount <- DatabaseCount (N, A) If NACount > Threshold: CDCWeight <- 1.0 Else: CDCWeight <- 0.0 End If-Else Next possible coreferent This algorithm is most useful when ambiguities arise that can be resolved by determining that an adjective that describes the missing argument can only describe one of its possible coreferents. This can be examined in the following short discourse from the MUC-II corpus: Two unidentified air contacts were fired upon by USS Georgetown with birds at 2218. Both contacts not squawking ID modes. No Esm. Confirmed not friendly. Picked up at brg 87fin Louisiana. Considered popup. Held by Georgetown and A-7. Last report. Here, the missing argument in the sentence, considered popup, is described as being popup. Since none of the previous words other than contacts can be described by the adjective, popup, contacts must be the appropriate coreferent to the missing argument in the sentence. The missing argument module can recognize this if it discovered that of the possible coreferents in this sentence, only contacts was described in the training data as being popup. For small amounts of training data, the appropriate threshold is one. Automating the determination of this threshold for large quantities of training data is beyond the scope of the work presented in this paper. 12 34 2.2.6 COREFERENT NOUN CATEGORY The sixth factor that contributes to the identification of the actual coreferents of a missing argument is applicable when the missing argument is compared with or equated to another noun. In order to appropriately apply this factor to a discourse, the missing argument module must recognize when two elements are compared with or equated to one another. The method used to evaluate this factor applies the fact that, when a missing argument is compared with or equated to a noun, the actual coreferent of the missing argument must also fit in the general category of that noun. This can be seen in the following example MUC-II sentences: Have positive confirmation that battleforce is targeted. Consideredhostile act. Here, the only possible coreferent of the missing argument in the second sentence is the 'targeting' that occurred in the first sentence, as the only entity that falls under the category of an act is the 'targeting'. Once again, the hypernym relation can be used to expand the missing argument module's coverage of this factor. In this case the hypernym relation is used to determine whether one noun falls into the general category of another. Thus, to determine the appropriate weight for a possible coreferent of a missing argument that is compared with a noun, the system lists the hypernyms of all previous possible entities and determines if the noun to which the missing argument is compared is in each of the lists. If that noun is in one of these lists, then the entity that corresponds to that list has a reasonable chance of being the actual coreferent of the missing argument. Once again it must be noted that determining the sense of the words in the output semantic frame is necessary for the appropriate use of the hypernym relation. Because such sense disambiguation is beyond the scope of the work described in this thesis, the implementation of 35 this method defaults to the most common sense of the word in WordNet for determining the list of hypernyms of the noun. The pseudo-code that implements this algorithm is as follows: For each possible coreferent C for a given missing argument MA: S <- the sentence in which MA is located run through the set of rules on S If no rule matches S: CNCWeight (- 0.0 Else: E (- the noun that the rule returns that is equated to MA of hypernyms returned by WordNet on N NList <- the list If E is found in NList: CDCWeight (- 1.0 Else: CDCWeight <- 0. 0 End If-Else End If-Else Next possible coreferent This algorithm provides the most ambiguity resolution when actual coreferents can be determined solely by the fact that the missing argument is compared to or equated with another noun, and the correct coreferent falls in the category of this noun. 2.2.7 COREFERENT ACTION RECENCY The final method described for weighting the possible coreferents of a missing argument is applicable when the missing argument takes some action and that same or similar action was recently taken by another entity. The factor that this method measures is the Coreferent Action Recency factor. One of the properties that occasionally makes the referent of a missing argument intuitively determinable is that the referent of the missing argument performs some action that some other entity recently performed. Because of this, when an entity has recently performed the same action as the missing argument, that entity is likely a coreferent of the missing argument. 36 To apply this method, only a few steps need to be taken. First, the 'recency' parameter needs to be set. This parameter determines how close, in terms of sentence distance, the missing argument has to be to a possible coreferent that performs the same or similar action as the missing argument in order for the method to weight the possible coreferent as a likely candidate for being the actual coreferent. This value can be set manually by estimating, or it can be determined by performing a statistical evaluation of the training data and determining the optimal measure. The missing argument module described in this thesis used the former, manual method. After the recency parameter is set, when the system is running on input data, every time a missing argument is performing some action, the system creates a list of synonymous actions. This list is created through the use of WordNet. The system then runs through the 'recent' possible coreferents to the missing argument, and determines if any of them are performing an action that is in the list of synonymous actions. If one or more of the possible coreferents is performing an action on this list, then those coreferents should be given a weight that is higher than if they were not performing an action on this list. The pseudo-code for this algorithm is as follows: For each possible coreferent C for a given missing argument MA: <- 0.0 If C is a proper noun: CV <- the verb that corresponds with the action C performs MAV <- the verb that corresponds with the action MA performs MAVList <- synonyms of MAV according to the WordNet database based on the appropriate sense of MAV If CV is found in MAVList: CARWeight CARWeight <- 1.0 Next possible coreferent Because of the use of the synonym relation, word sense ambiguity becomes a potential problem. Because such sense disambiguation is beyond the scope of the work described in this thesis, the implementation of this method defaults to the most common sense of the word in WordNet for determining the list of synonyms of the verb. 37 This algorithm provides the most use when ambiguities in the input can be resolved by the fact that the missing argument performs an action that is similar to the action performed by a previous proper entity. This can be seen in the following example: Two F-14 conducted a WAS against Kara. Conducted SSC to locate hostile ship and attacked with missiles. In this example, the correct coreferent of the missing argument in the second sentence is F-14. This is intuitive because in the previous sentence, the F-14 performs the action conducted and the missing argument performs the same action. This algorithm will therefore correctly disambiguate between F-14 and Kara as to which is the appropriate coreferent to the missing argument. 2.3 CHAPTER SUMMARY The structure of the missing argument module is oriented to accept as input the semantic frame output of the CCLINC system. By identifying possible coreferents, and by keeping records of previous possible coreferents, the system is capable of applying distinct methods that individually weight each possible coreferent of each missing argument. These distinct methods are categorized into seven components of the missing argument referent determination and replacement module. Each of these seven methods is capable of independently supplying a weight to each possible coreferent of a missing argument base on a unique factor that contributes to the accurate identification of the referents of missing arguments. A combining algorithm merges the weights outputted by these methods into a single resultant weight. The optimal combination of these weights will ideally enable the module to identify an appropriate coreferent of a missing argument. The entity with the highest final weight is guessed as being an actual coreferent of the missing argument. Once the coreferent of a missing argument is guessed, the module inserts the appropriate form of that coreferent into the location of the missing argument, and proceeds to the next semantic frame. 38 CHAPTER 3 SYSTEM TRAINING For the missing argument referent determination and replacement system to best recognize coreferents of missing arguments, many of the individual methods described in chapter two require training. Once the individual algorithms are trained, system training must be performed in order to determine the appropriate weight of the various outputs of the individual algorithms. By training the system in this manner, an estimation can be made of the likelihood that a given possible coreferent of a missing argument is an actual coreferent. The purpose of this chapter is to describe the method used for combining the outputs of the individual algorithms. The reason why this method is appropriate for estimating the likelihood that a possible coreferent of a missing argument is an actual coreferent of that missing argument is also explained. The results from applying the individual algorithms to the training data are presented in this chapter as well. 3.1 A METHOD FOR COMBINING WEIGHTS In order to determine a method for combining the weights assigned to a possible coreferent by the individual methods described in chapter two, it is first necessary to examine the task that the system should be performing. Ideally, the system should determine the exact probability that a given possible coreferent is an actual coreferent. With a perfect system, a probability of one would be assigned to the correct referent, and a probability of zero would be assigned to all of the other entities considered. The actual system, however, will only be able to 39 estimate the probabilities according to the outputs of the individual algorithms. Consequently, it is desirable for the system to assign a weight to each possible coreferent that is proportional to the probability that that possible coreferent is a correct coreferent, based on the knowledge available to the system. In order to analyze the appropriate method for combining the weights assigned by the various methods, it will be useful to assume independence among the weights assigned by each of the seven methods. This assumption provides for a simpler method for determining the optimal combination of the weights generated for the various factors as described in chapter two. This combination becomes simpler under the assumption of independence because adjustments of the overall weight assigned by the system for a given possible coreferent can be made on the basis of each method separately13 . This simplification and consequent mathematical separation is shown through the use of equations 3-1 through 3-5 below. To determine the weight of each individual method, first notice that equation 3-1 holds true for a single algorithm, A, because of Bayes Law, where Pos is a possible coreferent, Act is the actual coreferent, Weight(A) is the weight assigned to Pos by the given algorithm A, and K is some constant: EQUATION 3-1: Pr[ Pos = Act IWeight(A)= K] = Pr[ Pos = Act n Weight(A) = K] / Pr[ Weight(A) = K] Equation 3-2 extends the rule to all seven algorithms, A, through A7 and corresponding constants K, through K7 . EQUATION 3-2: Pr[ Pos = Act IWeight(Ai) = Ki V i] Pr[ Pos = Act n Weight(A 1) = K, n Weight(A 2)= K2 r ... n Weight(Ai) = Ki] Pr[ Weight(Ai) = Ki V i ] A more accurate system would take into consideration that the weights outputted by each individual algorithm might not be independent. Such an analysis and the appropriate construction of an according weight-combination algorithm are beyond the scope of the material presented here. 13 40 Because of the assumed independence, the following equations 3-3 and 3-4 hold true: EQUATION 3-3: Pr[ Pos=Act r Weight(A) = K1 n Weight(A 2 ) = K 2 r ... n Weight(A 7) = K7 ] Pr[ Pos=Act r Weight(A1)=K1 ] * Pr[ Pos=Act n Weight(A 2)=K2 Pr[ Weight(Ai) = Ki V i] * Pr[ Pos=Act r Weight(A 7)=K 7] * ... * (Pr[ Pos=Act ] )6 EQUATION 3-4: = Pr[ Weight(Ai) = Ki V i] H Pr[Weight(Ai) = Ki) Substituting equation 3-3 and 3-4 into equation 3-2, results in the equation 3-5: EQUATION 3-5: Pr[ Pos = Act I Weight(Ai) = Ki V i] Pr[ Pos=Act r Weight(A1 ) = K,] * Pr [Pos=Act n Weight(A 2) = K2 ] * (Pr[ Pos=Act ] )6 * 1 * Pr[ Pos=Act r Weight(A 7)=K 7] Pr[ Weight( Ai )= Ki] Through the use of equation 3-5, the system can estimate the probability that a given coreferent is an actual coreferent if it can estimate each of the factors in the second half of the equation. Fortunately, each of these factors can be estimated if the system trains on a set of data that is similar to the actual data and records appropriate values according to the outputs of the individual algorithms for each coreferent of each missing argument. More specifically, Pr[ Weight( A; ) = K;] can be estimated by dividing the number of times algorithm Ai outputs Ki by the number of coreferents seen for all values of i. The value, Pr[ Pos = Act ], is constant across all possible coreferents of a given missing argument and can consequently be disregarded as long as it is disregarded for all possible coreferents of a given missing argument. Pr[ Pos = Act n Weight( A;) = Ki ] can be estimated by dividing the number of times algorithm 41 Ai rates the actual coreferents with the value Ki by the total number of possible coreferents. The division by the total number of possible coreferents can be disregarded as long as it is disregarded for all possible coreferents, as the system only needs to maintain proportionality in equation 3-5, rather than equality. This relaxation is acceptable because only the relative distance among the weights assigned matter. Once the above equations are estimated, the result is proportional (in estimation) to the likelihood that a given possible coreferent of a missing argument is an actual coreferent of that missing argument. The remaining training work requires running the system on the training data, to evaluate Pr[ Pos = Act n Weight(Ai) = Ki ] and Pr[ Weight(Ai) = Ki ] for all possible Ai and Ki. This can be done one algorithm at a time. To determine the amount of training data that should be used, it is necessary to examine the change in the values for each algorithm as more training data is accumulated by the system. As the system trains on more and more data, the values assigned by the above equations for each individual method should level off. In other words, the system should be trained on enough data so that the addition of more training data changes the constants set by training only minimally. 3.2 RESULTS FROM TRAINING THE ALGORITHMS For the training performed by the system, the total number of sentences used for the MUC-II domain was 647. These sentences comprised 105 paragraphs, and included 41 instances of missing arguments. The analysis of training that appears in this chapter is based on the implementation of the missing argument module. The present missing argument module is built into an older version of the CCLINC system than that described in chapters one and two. It is also important to note that the original form of the MUC-II sentences and paragraphs was different than the form used by the system. The data that the missing argument module uses went through two steps of modification. In the first step, the sentences were manually extracted from telegraphic messages and set into a paragraph form with one paragraph per message. From 42 this state, the sentences were altered so that their meaning could be understood by the older version of the CCLINC system. This transformation included removing the headers from the paragraphs as well as altering the sentences so that they could be parsed by the system. In general, the alterations were kept to a minimum. When they were necessary, they were made by removing the cause of the parse failure while minimally changing the general meaning of the paragraph. In some cases, the consequence of this was that large sentences were broken up into two smaller sentences to allow parsing. In other instances of parse failure, sentence fragments that did not contribute greatly to the meaning of a paragraph and did not parse correctly were removed completely. If it was necessary to excessively alter a paragraph in order for the system to parse the sentences, the entire paragraph was removed from the data. The Sentence and ParagraphDistance algorithm requires training on a set of data with the appropriate missing arguments marked. Table 3-1 shows the results from training the sentence and paragraph system on the MUC-II training data. The last two fields are the values that the system will place in equation 3-5 when the system runs on the actual data in order to estimate the probability of a given possible coreferent of a specific missing argument. The Distance field in table 3-1 corresponds to the distances encountered during training of the actual coreferents of the missing arguments relative to the missing arguments themselves. For example, in the following discourse, the coreferent of the first missing argument in the following discourse would fall in the category One Sentence, as the missing argument refers to the action of shooting down the aircraft. The coreferent of the second missing argument in the paragraph would fall in the category Two Sentences in table 3-1 as it refers to the aircraft,which is found two sentences prior to the missing argument. The enemy ship shot down the aircraft.Considered hostile act. Was on a topsecret mission. 43 Training Results for the Sentence and Paragraph Distance Method on MUC-II Data NUMBER OF *POSSIBLE Pr[Poss=Act fl DISTANCE______OCCURRENCES OCCURRENCES Weight =K Pr[Weight=K One Sentence 22 255 .025 .297 Two Sentences 6 136 .007 .158 Three Sentences 2 66 .002 .077 Four or More Sentences 3 57 .003 .067 More Than a Paragraph 5 277 .006 .323 Other** 14 56 .016 .068 * ** This number is based on the estimated number of possible occurrences per sentence. Other includes entities such as the listener or the speaker of the discourse. TABLE 3-1: Training results for the Sentence and ParagraphDistance method on MUC-II data. The Number of Occurrences field in table 3-1 is calculated by counting the number of actual coreferents that were Distancefrom their corresponding missing argument in the training data. The Possible Occurrences field is calculated by multiplying the estimated number of possible coreferents per missing argument by the total number of missing arguments. This estimation was calculated by randomly sampling a portion 4 of the training data and determining the number of possible coreferents of missing arguments that occurred Distance away from each missing argument in the portion of training data. This number was then extracted to the entire data. Pr[ Pos = Act n Weight = K ] was calculated by dividing the Number of Occurrences field by the total number of possible coreferents of missing arguments. The rightmost field, Pr[ Weight = K ] was calculated by dividing the Possible Occurrences field by the total number of possible coreferents of missing arguments in the training discourse. The Coreferent Grammatical Function method and the Verb Form Correlation method also require training on a set of data. The Number of Occurrences and the Possible Occurrences fields as seen in tables 3-2 and 3-3 are the results of this training. The last two fields of these tables correspond with the values that will be entered into equation 3-5 depending upon the grammatical function of the given possible coreferent of a missing argument in the actual data. 14 15 out of a total 58 instances of missing arguments were used in the random sample. 44 Training Results for the Granunatical Function Method on MUC-I Data GRAMMATICAL FUNCTION NUMBER OF *POSSIBLE Pr[Poss=ActA Weight=K] OCCURRENCES OCCURRENCES Pr [ Weight=K Subject Object 26 5 205 128 .030 .006 .239 .149 MainVerb 0 184 .000 .215 Other Noun 4 214 .005 .250 Other Verb 1 67 .001 .078 Other** 14 58 .016 .068 * This number is based on the estimated number of possible occurrences per sentence. ** Other includes entities such as the listener or the speaker of the discourse. TABLE 3-2: Training results for the Coreferent GrammaticalFunction method on MUC-Il data. In table 3-2, the Grammatical Function column corresponds with the grammatical function of the various entities in the training data with respect to each missing argument. For example, in the example given previously, each of the actual coreferents perform different grammatical functions: The enemy ship shot down the aircraft. Considered hostile act. Was on a topsecret mission. The referent of the first missing argument corresponds with shot down, and would fall into the category, Main Verb on table 3-2 as shot down is the action that the subject is taking in the sentence. The referent of the second missing argument in the above example corresponds to aircraft and would fall into the category Subject in the table. Enemy ship would also be considered a possible coreferent of the two missing arguments, and would fall in the category 15 Subject, as it is the subject of the sentence'. In table 3-3, the possible coreferent category and verb form column corresponds to the category of each possible coreferent and the verb form in the sentence of the missing 15 It should be noted that all of the numbers in the table are with respect to each missing argument. Each entity is counted once for each missing argument of which it is a possible coreferent. 45 argument as determined by the Verb Form Correlation method explained in chapter two. In the example paragraph, each of the missing arguments correspond to different verb forms in table 3-3. The enemy ship shot down the aircraft. Considered hostile act. Was on a topsecret mission. The action that the first missing argument takes, considered, is in the passive tense. Consequently, all possible coreferents counted in table 3-3 for this missing argument would fall in a row that had the form Passive. To look at a specific example, the enemy ship is considered as a possible coreferent of that missing argument, and would fall into the coreferent category, Action Performer. Consequently, the enemy ship as a possible coreferent of the second missing argument in the example paragraph would fall into the category Action Perf / Passive. Training Results for the Verb Form Correlation Method on MUC-H Data Possible Coreferent Category and Verb Form Speaker / Pluperfect NUMBER OF OCCURRENCES 7 10 .012 .017 Speaker / Present 2 2 .003 .003 Speaker/ Passive 0 11 .000 .018 Speaker / Other 1 18 .002 .030 ActionPerf./Pluperfect 3 30 .005 .050 ActionPerf./Present 0 3 .000 .005 ActionPerf./Passive 7 33 .012 .054 Action Perormer / Other 16 54 .026 .089 Other /Pluperfect 0 25 .000 .041 Other / Present 0 2 .000 .003 Other/Passive 4 28 .007 .046 Other / Other 1 48 .002 .079 * *POSSIBLE Pr [Poss = Act fl Weight =K ] OCCURRENCES Pr[ Weight= K This number is based on the estimated number of possible occurrences per sentence. TABLE 3-3: Training results for the Verb Form Correlationmethod on MUC-II data. 46 Each of the other columns in table 3-2 in table 3-3 corresponds to the same calculation as did those in table 3-1. The Number of Occurrencesfor these tables corresponds with the number of times an actual coreferent of a missing argument falls in the category corresponding with the first column. The Possible Occurrences column corresponds with the number of times any possible coreferent of a given missing argument falls into the category given in the first column. The last two categories are calculated by dividing Number of Occurrences and Possible Occurrences respectively by the total number of possible occurrences of all of the missing arguments in the training data. The result of the individual training of the Coreferent Action Capability method is a lengthy list of noun-verb pairs. Included in each noun-verb entry in the output of the data is whether or not the verb was in the passive tense. To determine the appropriate weight of the Coreferent Action Capabilityalgorithm, then, the system used the noun-verb list produced from training to determine if a given possible coreferent of a missing argument could perform the same action that that missing argument performed. The results of this training can be seen in table 3-4. The YES row corresponds to possible coreferents that are represented as a noun in the noun-verb list with a corresponding action that is the same as that which the missing argument performs. The NO row corresponds to those possible coreferents that are not found with a corresponding verb in the list. The result of training the Coreferent Description Capability method is similar to that of the training of the Coreferent Action Capability method. Instead of a list of noun-verb pairs, the training results in a list of noun-adjective pairs, where, ideally, the list encapsulates the relation between nouns found in the discourse and the possible adjectives that can describe them. Training Results for the Action Capability Method on MVUC-I[ Data Is the Possible Coreferent Capable ofPerforing Pr [Poss= Act n Weight = K ] Pr [ Weight=K ] the Action that the Missing Argunent Perfirms? .019 .049 YES NO .047 .953 TABLE 3-4: Training results for the CoreferentAction Capabilitymethod on MUC-II data. 47 Training Results for the Description Capability Method on MUC-il Data Is the Possible Coreerent Capable ofBeing Described by an Adjective Similar it that which Pr [Poss = Act A Weight=K ] Pr [ Weight =K Describes the Missing Argunent? YES .008 .027 NO .060 .973 TABLE 3-5: Training results for the Coreferent DescriptionCapabilitymethod on MUC-II data. Training Results for the Noun Category Method on MUC-H1 Data Could the Possible Coreferent Fall into the Same Noun Category as the Missing Argument? Pr [Poss = Act A1Weight=K ] Pr [ Weiht=K ] ig YES .003 .008 NO .064 .992 TABLE 3-6: Training results for the Coreferent Noun Category method on MUC-II data. Consequently, table 3-5 is similar to table 3-4 in terms of its entries. The YES row corresponds to those possible coreferents of missing arguments that the system determined were capable of being described by an adjective that also described the missing argument. The NO row corresponds to those possible coreferents for which the system did not find a noun-adjective entry in the description capability list. Neither the Coreferent Noun Category method nor the Coreferent Action Recency method requires individual training. Tables 3-6 and 3-7 show the correspondence between the output of these two methods and the values that will be inserted into equation 3-5. Training Results for the Action Recency Method on MUC-IH Data Was a Similar Action Recenrty Perforiedby the Possible Corefernt ? Pr [Poss= Act AWeight=K] Pr[Weight=K] YES .007 .008 NO .060 .992 TABLE 3-7: Training results for the CoreferentAction Recency method on MUC-II data. 48 For tables 3-4, 3-5, 3-6, and 3-7, the entries were calculated from the training on the system. Pr[ Pos=Act n Weight=K ] was determined by dividing the number of actual coreferents that were assigned a YES or a NO (depending upon the row the entry is in) by the corresponding algorithm by the total number of possible coreferents in the training data. The value Pr[ Pos=Act ] was determined by dividing the number of possible coreferents that were assigned a YES or a NO (depending upon the row the entry is in) by the corresponding algorithm by the total number of possible coreferents in the training data. 3.3 TRAINING SUMMARY In general, it is not possible to determine which of the seven algorithms is best for determining a possible coreferent. It should be noted, however, that a few of the algorithms are extremely accurate when they output a certain way. The Coreferent Action Recency algorithm, for example correctly determines the missing argument with 87% accuracy (on the training data), however it is only applicable about ten percent of the time. In general, the first three methods (Sentence and ParagraphDistance, Coreferent Grammatical Function, and Verb Form Correlation)seem to be applicable to every possible coreferent, but do not strongly correspond with the actual coreferent. The last four methods, (Coreferent Action Capability, CoreferentDescription Capability, CoreferentNoun Category, and CoreferentAction Recency,) however, seem to be only applicable to a small number of possible coreferents, but, when they are applicable, their output corresponds strongly to the correct coreferent. 3.4 CHAPTER SUMMARY Under the assumption of independence among the outputs of the seven algorithms developed in chapter two, a simple equation for estimating the optimal combination of the outputs is realizable. By applying the seven methods to a set of training data, information is gathered to optimize this equation based on the outputs of the methods. This optimization is derived from a simple probabilistic analysis. Using this training to optimize the combination algorithm should optimize the system for when it is running on actual data, assuming the training data is similar to the actual data. 49 CHAPTER 4 CONCLUSION Missing argument referent determination and replacement can be a useful technique for certain telegraphic military domains, and may extend into other areas of language. A missing argument referent determination and replacement module has been integrated into the CCLINC translation system at Massachusetts Institute of Technology Lincoln Laboratory. With additional work, the technique can be improved and extended to other areas of interest. This chapter first presents an overview of the accomplishments achieved and the downfalls handled in the research work that is presented in this thesis. Also discussed are areas for improvement upon the present state of the missing argument system as well as a few possible applications of the work. A review of the goal of the thesis is presented, along with a brief thesis summary. 4.1 SUMMARY OF RESEARCH WORK The purpose of the work presented in this thesis is to develop a missing argument referent determination and replacement module, and integrate the module into the CCLINC system. A result of the completion of this task comprises an advancement in the field of missing argument referent determination and replacement and automated natural language understanding as a whole. 50 These tasks have been accomplished primarily through the development and integration of seven separate methods into a single missing argument module. Some of the methods were developed individually; others are based on previous work done by other computational linguists. However, the achievement attained through the evolution of any one individual method is dwarfed by the accomplishment of the integration of each method into the whole missing argument module. Nonetheless, each individual method is, in itself, a necessary and important component of the missing argument referent determination and replacement system. The Sentence and ParagraphDistance method weights possible coreferents based on their distance from the missing argument. The Coreferent Grammatical Function method weights possible coreferents based on their grammatical function. The Coreferent Verb Category method provides a weight based on both the category into which the possible coreferent falls and the form of the verb following the missing argument. The Coreferent Action Capability method bases its output on whether or not a given possible coreferent is capable of performing the action taken by the missing argument. The Coreferent Description Capability method weights the possible coreferent based on whether or not it can be described by the adjectives that describe the missing argument. The Coreferent Noun Category method determines the weight assigned by evaluating whether or not the possible coreferent can fall into the same noun category as the missing argument. Finally, the Coreferent Action Recency method weights a coreferent based on whether or not it is performing an action that is similar to the action that the missing argument is performing. These seven methods provide the backbone of the system. These methods are combined into the missing argument module based on the mathematical evaluation of the results arrived at from training the system on a set of data. Once the missing argument module is trained, it takes as input semantic frames generated by the CCLINC system. It then stores each entity represented in the semantic frame input into a separate data structure. When a missing argument in the semantic frame input is noticed by the module, each component of the missing argument module weights the entities that are considered possible coreferents. These weights are combined based on the results from the training of the 51 system, and each possible coreferent of that missing argument is assigned an overall weight. Then, the system chooses the possible coreferent with the highest overall weight for a given missing argument as the appropriate coreferent of that missing argument, and inserts an appropriate form of this entity into the location of the missing argument. The module then proceeds with the next semantic frame input. 4.2 FUTURE WORK Like many research systems, there is no conclusive stopping point to the missing argument referent determination and replacement research. The system can be improved upon by increasing accuracy or adding functionality. The purpose of this section is to point out some of the strings that have been reluctantly left untied. Clearly, as explained in chapter two, the implementation of the integration of the algorithms in the missing argument module of the CCLINC system allows for further algorithmic development. A new algorithm merely needs to be developed and added into the system in the same way as all of the other algorithms. The only requirement on the algorithm is that it weights every possible coreferent of a given missing argument. The concept of dynamically training the system is briefly mentioned in chapter three. Although potentially complex, the simple mathematical model used in combining the various algorithms may provide for a simple method to dynamically train the system. Dynamic training would certainly be extremely advantageous, as one of the major pitfalls of the present missing argument system is that it requires training on a set of data with the correct coreferents of the missing arguments manually marked. Removing this requirement would be useful indeed. Integrating a word sense disambiguation technique that would allow WordNet to reliably create the correct list of synonyms and hypernyms for the methods that use these relations would also improve the system. In doing this, the Coreferent Action Capability method, Coreferent Description Capability method, Coreferent Noun Category method, and the Coreferent Action Recency method would all be improved. 52 The work presented in this thesis has primarily taken the direction toward resolving missing argument ambiguities in sentences where only the argument is missing; in general, the methods do not work well for stand-alone noun phrases. A unique possibility for future work would be to direct attention solely to these constructs. The techniques used in the missing argument referent determination and replacement module are applicable in other fields of research as well as in translation. For example, applications where speech is converted to text by a computer may often encounter noise. Whether this is electrical, vocal, or static noise, portions of speech may be unintelligible. Techniques developed from this research could potentially recover unintelligible portions of the input. A similar application exists for scanning in text from hard copies and identifying the actual letters and words. With the advent of computers, there is a driving force to provide electronic versions of all hard copy work. One purpose of this is to allow useful texts to be accessible via the Internet. To store the hard copies efficiently, however, it is necessary to convert the scanned images into ASCII text. Even the best image to ASCII converters are only so accurate. A possible improvement upon the accuracy would be to incorporate a missing argument referent determination module into the system so that unintelligible or questionable translations can be verified. 4.3 THESIS SUMMARY In chapter one, four goals of the thesis were presented. understanding, automated natural language translation, and determination and replacement were presented in chapter one. Automated natural language missing argument referent A description of the seven methods for missing argument replacement and referent determination was given in chapter two. Chapter three described the incorporation of these seven methods into CCLINC system module. Finally, this chapter presented conclusions from the implementation, as well as a view into the future work of missing argument referent determination and replacement at MIT Lincoln Laboratory in general. 53 BIBLIOGRAPHY 1. Allen, James. Natural Language Understanding, Second Edition. Benjamin/Cummings Publishing Company Inc.: Redwood City, California, 1995. 2. Dahl, Deborah and Catherine Ball. Reference Resolution in PUNDIT, Unisys Corporation: Chapter 8 in Logic and Logic Grammars for Language Processing, Saint-Dizier, Patrick and Stan Szpakowiz, editors: From Ellis Horwood Series in AL. 3. Fromkin, Victoria and Robert Rodman. An Introduction to Language, Sixth Edition. Harcourt Brace College Publishers: Fort Worth, Texas, 1998. 4. Hutchins, John, and Harold. Somers. An Introduction to Machine Translation. Academic Press Limited: London, England, 1992. 5. Hwang, Jung-Taik. A Fragmentation Technique for Parsing Complex Sentences for Machine Translation. Massachusetts Institute of Technology Lincoln Laboratory: Lexington, Massachusetts, 1997. 6. Lee, Young-Suk, Clifford Weinstein, Stephanie Seneff, and Dinesh Tummala. Ambiguity Resolution for Machine Translation of Telegraphic Messages. Proceedings of the 3 5th Annual for Computational Linguistics: Madrid, Spain, 1997. 7. Lin, Dekang. Description of the PIE System Used for MUC-6. Department of Computer Science, University of Manitoba: Manitoba, Canada, 1996. 8. Miller, George, Thomas. Martin Chodorow, Shari Landes, Claudia Leacock, and Robert Using a Semantic Concordancefor Sense Identification. Cognitive Science Laboratory: Princeton University, Princeton, 1994. 9. Palmer, Martha, Deborah Dahl, Rebecca Schiffman, Lynette Hirschman, Marcia Linebarger, and John Dowding. Recovering Implicit Information. Unisys Corporation. Initial printing; Proceedings of ACL, 1986. 54 10. Park, Hyun, Dania Egedi, Martha Palmer. Recovering Empty Arguments in Korean. Institute for Research in Cognitive Science, University of Pennsylvania: Philadelphia, Pennsylvania, 1995. 11. Seneff, Stephanie. Tina: A Natural Language System for Spoken Language Applications. Laboratory for Computer Science, Massachusetts Institute of Technology: Cambridge Massachusetts, 1992. 12. Seneff, Stephanie, Dave Goddeau, Christine Pao, and Joe Polifroni. Discourse Modeling in a Multi-User Multi-Domain Environment. Multimodal Laboratory for Computer Science, Massachusetts Institute of Technology: Cambridge, Massachusetts, 1996. 13. Tummala, Dinesh, Stephanie Seneff, Douglas Paul, Clifford Weinstein, and Dennis Yang. CCLINC: System Architecture and Concept Demonstration of Speech-to-Speech Translationfor Limited-Domain Multilingual Applications. Proceedings of the 19995 ARPA Spoken Language Technology Workshop. Massachusetts Institute of Technology Lincoln Laboratory: Lexington Massachusetts, 1995. 14. Walker, Marilyn, Massayo Ida and Sharon Cote. Japanese Discourse and the Process of Centering. The Institute for Research In Cognitive Science, University of Pennsylvania: Philadelphia. Pennsylvania, 1992. 15. Weinstein, Clifford, Young-Suk Lee, Stephanie Seneff, Dinesh Tummala, Beth Carlson, John T. Lynch, Jung-Taik Hwang and Linda Kukolich. Automated English / Korean Translation for Enhanced Coalition Communications. Volume 10, Number 1, Lexington Massachusetts, 1997. 55 Lincoln Laboratory Journal: