From: AAAI-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. Interfacing Issues for Information Extraction Peter Vanderheyden and Robin Cohen Department of Computer Science, University of Waterloo Waterloo, Ontario, Canada N2L 3G1 fpbvander, rcoheng@uwaterloo.ca Traditional approaches to information extraction implicitly assume that many elements of the task are static — the user’s query, and the description of domain and corpus, for example. We believe that in many real situations, however, this assumption does not hold and it is important to consider how the system could best support interaction with the user when the assumption breaks down. Current goals in the information extraction community are for the system to produce accurate results while being easy to retrain and port to a new domain. We seek to extend current approaches to handle dynamic elements of the problem. “Evolving queries”, discussed in the information retrieval (IR) literature, need to be supported by information extraction (IE) systems; IR and IE are both, after all, tools for gathering information from documents in response to a user query. When a casual user — neither an expert in the use of the system, nor in the domain — engages in any information gathering task, there will be an initial phase of investigation and discovery during which the user becomes familiar with the system, the domain, and the documents in the corpus and the user’s query may change over time or evolve. For example, a user may have a query about terrorist activities, asking for the names of perpetrators and the locations of targets; an interim system output prompts the user to refine the query, redefining terrorist activities as involving only a subset of weapons while generalizing to allow for additional (e.g., government) perpetrators. The query is not the only element that may change over time; certainly the domain evolves as additional documents are processed. As well, when the corpus is very large or dynamic (e.g., the Internet), the corpus itself may be seen as evolving — rules for mapping text patterns to query items that apply at one time or for one portion of the corpus no longer apply for another. To provide more robust support for information extraction in a dynamic environment, we consider such issues as: appropriate “modalities” for an interface to large amounts of natural language text — the user and system need shared access to all information in order to interact about the current problem state. In existing systems, these include document text, that text annotated with system- Copyright c 2000, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. specific markups (Bird & Liberman 1999), and possibly a syntax of markup rules; we add additional modalities (e.g., heuristics in the user model; tripartite model of the query, domain, and corpus). appropriate opportunities when the system or user should take initiative to interact with one another or to modify the current state — we have proposed a “mixed-initiative” approach (Vanderheyden & Cohen 1999), expanding the opportunities for interaction and shared control of modalities; the system continuously evaluates the need and costbenefits of its actions. Regularities in the document text not yet identified by the user, or interim results produced by the system (e.g., a partial answer to the user’s query recognized in a document) may prompt the system to initiate interaction with the user. features for knowledge representation — we allow the user multiple views of the data, and direct manipulation of the domain and query representations. Modifying or navigating through one modality will update others. features for machine learning — while the rule of thumb has been that larger increments (batch learning) give better accuracy, smaller increments (online learning) allow better adaptivity; and active learning (Engelson & Dagan 1996) lets the system control its own training examples. We intend to investigate how choices made on these and related issues affect performance. Specifically, our initial focus will be on how best to support evolving information models (query and domain) interactively. References Bird, S., and Liberman, M. 1999. Annotation graphs as a framework for multidimensional linguistic data analysis. In Towards Standards and Tools for Discourse Tagging, Proc. of the Workshop, 1–10. ACL. Engelson, S. P., and Dagan, I. 1996. Sample selection in natural language learning. In Wermter, S.; Riloff, E.; and Scheler, G., eds., Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in AI. Springer, Berlin. 230–245. Vanderheyden, P., and Cohen, R. 1999. Designing a mixedinitiative information extraction system. In Working Notes of the AAAI’99 Workshop on Mixed-Initiative Intelligence, 142–146.