Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Towards a Robust Deep Language Understanding System Mehdi H. Manshadi University of Rochester Rochester NY 14627 mehdih@cs.rochester.edu Abstract The proposal and the plan We propose a system that bridges the gap between the two major approaches toward natural language processing: robust shallow text processing and domain-specific (often linguistically-based) deep understanding. We propose to use an existing linguistically motivated deep understanding system as the core and to leverage statistical techniques and external resources such as world knowledge to broaden coverage and increase robustness. We will also develop a semantic representation framework, which supports underspecification, granularity and incrementality, the critical factors of robustness in representing natural language semantics. The proposal has three main parts to be done in three phases as follows. 1) A framework for robust semantic representation: Very often, full specification of the semantic information of a sentence requires deeper levels of analysis such as discourse and pragmatics or huge external resources such as world knowledge. Fortunately for many real world applications, an underspecified semantic representation (e.g. an scope-underspecified representation) will suffice to achieve the task. However this requires the semantic representation to support underspecification, to allow different levels of granularity, and to be built incrementally. Although there has been a lot of work in scope underspecification, to the best of our knowledge, there is no semantic framework in the area of computational linguistics which systematically combines all the above three properties. 2) Constructing annotated corpus: The next step is to provide a corpus of sentences labeled with semantic representations within the framework designed in the first phase. This corpus will be small and is only used for development and evaluation purposes. 3) Leveraging external resources: The lack of robustness in linguistically motivated systems such as TRIPS come from the fact that most resources are manually built and hence are limited. We propose to use external resources such as world knowledge and/or shallow processing techniques to overcome this weakness. Recently some reasonably large automatically extracted world knowledge systems have been provided (e.g., Clark and Harrison 2009, Schubert 2002). The idea is to use this knowledge to improve robustness by techniques such as generalizing the existing methods for disambiguation which is based on hand-build selectional restrictions; creating new word senses justified by world knowledge; and developing knowledge-based methods for interpreting fragments when a full parse cannot be found. We also plan to use bootstrapping techniques, using the TRIPS parser, to create additional bodies of world knowledge from sources such as WordNet glosses and other online sources. Motivation TRIPS (The Rochester Interactive Planning System) (Ferguson and Allen, 1998) is a linguistically motivated language understanding system which has been used in several real world applications such as PLOW (Procedure Learning On the Web) (Allen et al., 2007). The system uses an augmented (unification-based) phrase structure grammar; the semantic representation is based on a flat version of the Alshawi-style Quasi Logical Form (Allen 1995); and the ontology is motivated by FrameNet (Johnson et al., 2003) and VerbNet (Kipper et al., 2000). Core resources of the parser (lexicon, grammar, ontology, interpretation rules) have been manually built. Recent extensions to the parser (Allen et al., 2008), however, use external resources such as WordNet (Fellbaum 1998) for cases where a word cannot be found in the TRIPS lexicon. This especially helps to broaden the coverage of the ontology, when the corresponding concept cannot be found in the TRIPS ontology. Another step taken in order to improve the quality of the core parser is to apply shallow text processing modules such as statistical parsers and named-entity recognizers to the text and use the outputs to guide the parser. For example, if a sequence of words is recognized as a named-entity, this information helps the parser to pick the right parse. In spite of these improvements, similar to almost all other deep understanding systems, the parser still suffers from lack of robustness. We propose to take further steps in order to broaden the coverage and increase the robustness of TRIPS. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1990 convert Penn Treebank trees into highly underspecified semantic graphs, and then to incorporate all the existing annotations of the Penn Treebank such as verb senses, semantic roles, discourse information, etc. to incrementally add more semantic information to the graphs. Human annotators will then revise these automatically generated semantic graphs in order to build gold standard semantic representations for the selected portion of the Treebank. Accomplished work Although there has been a lot of work on scope underspecification in semantic representation, previous scope-underspecification frameworks suffered from either limited expressivity or intractability. Therefore, as the first step toward developing an underspecified representation, we developed a framework that transcends the limited expressivity of the previous formalisms and yet remains tractable (Manshadi et al., 2008, 2009, 2010). We presented polynomial-time algorithms for satisfiability and enumeration of our formalism. Our framework also bridges the gap between previous scope-underspecification formalisms and brings them under a universal representation. As the next step, we developed a general framework for semantic representation, called Semantic Graphs. In semantic graphs, nodes specify the words or concepts, solid edges represent the structural information and dotted edges provide argument (semantic role) information. For example figure (1) shows the semantic graph for the sentence Every hungry dog chases a cat. Semantic graphs not only allow for underspecification of quantifier scoping but support underspecification of many other complex (and often controversial) semantic phenomena such as word senses, semantic roles, etc. At the same time, the framework supports different levels of granularity in underspecification. For example, semantic roles form a hierarchical structure in which upper nodes represent coarse-grained roles, and lower nodes represent finegrained roles. Collapsing and expanding the nodes of a graph also achieve granularity. For example, if no syntactic tree is available for a sentence, but the chunking information is provided using a NP-chunking module, then the whole noun phrase can be represented as a composite node, which can later be expanded if further processing becomes available. Semantic graphs provide incrementality by supporting different depths of analysis from part of speech tagging to discourse and pragmatics analysis. In fact, every processing level incrementally adds new information to the graph by building new nodes, adding new edges, and updating labels (of nodes/edges) to more fine-grained ones. Future work The next step will be to incorporate external resources such as world knowledge into TRIPS as mentioned above. We will use knowledge bases such as DART (Clark and Harrison 2009) for this purpose. We will also explore using statistical processing techniques such as semi-supervised learning methods in a bootstrapping fashion on the cheaply available unlabeled data to improve the quality of the system. All these systems will be evaluated on the gold standard annotated data provided in phase 2. References Allen, J. (1995) Natural Langue Understanding, BenjaminCummings Publishing Co., Inc. Allen, J. F., Swift, M. and Beaumont, W. (2008) Deep Semantic Analysis for Text Processing. Semantics in Systems for Text Processing (STEP 2008). Allen, J., Chambers, N., Ferguson, G., Galescu, L., Jung, H., Swift, M., and Taysom, W. (2007). PLOW: A Collaborative Task Learning Agent. AAAI-07. Clark, P., Harrison, P. (2009) Large-Scale Extraction and Use of Knowledge From Text. KCap-09. Fellbaum C. (1998), WordNet: An electronic lexical database , The MIT Press, Cambridge, MA. Ferguson, G. and Allen, J. (1998) TRIPS: An Intelligent Integrated Problem-Solving Assistant, AAAI-98. Johnson, C., M Petruck, et al., (2003), FrameNet: Theory and practice. Berkeley, California. Kipper, K., H. T. Dang, and Palmer, M., (2000) Classbased construction of a verb lexicon. AAAI-2000. Manshadi, M, Allen, J., and Swift, M. (2009) An efficient enumeration algorithm for underspecified semantic representations. Formal Grammar, Hamburg, Germany. Manshadi, M, Allen, J., and Swift, M. (2008) Toward a Universal Underspecifed Semantic Representation. Formal Grammar (FG 2008), Hamburg Germany. Manshadi, M, Allen, J., and Swift, M. (2010) A universal framework for underspecified semantic representation. Forthcoming. Marcus, M. P., Santorini B., and Marcinkiewicz M. A. (1993) Building a Large Annotated Corpus of English: The Penn Treebank, Computational Linguistics, Volume 19, Number 2 (June 1993). Schubert, L. K., (2002), Can we derive general world knowledge from texts? , Human Language Technology Conference (HLT 2002), San Diego, CA, 2002. Ongoing work At the moment we are annotating a small part of the Penn Treebank (Marcus et al., 1993) with semantic graphs. Our approach is to design some interpretation rules which 1991