Responsive Information Architect: Enabling Context-Sensitive Information Seeking Michelle X. Zhou Keith Houck Shimei Pan James Shaw Vikram Aggarwal Zhen Wen IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 {mzhou, khouck, shimei, shawjc, vikram, zhenwen}@watson.ibm.com Abstract1 U1 Text: Show colonials over $3M in cities along Hudson Information seeking is an important but often difficult task especially when involving large and complex data sets. We hypothesize that a context-sensitive interaction paradigm can greatly assist users in their information seeking. Such a paradigm allows a system to both understand user data requests and present the requested information in context. Driven by this hypothesis, we have developed a suite of intelligent user interaction technologies and integrated them in a full-fledged, context-sensitive information system. In this paper, we review two sets of key technologies: context-sensitive multimodal input interpretation and automated multimedia output generation. We also share our evaluation results, which indicate that our approaches are capable of supporting context-sensitive information seeking for practical applications. R1 Speech: I found two houses. Graphics: Display (a) U2 Text: Tell me more about the cities R2 Speech: Here is the information of Briarcliff Manor and Irvington Graphics: Display (b) Introduction Information seeking, such as finding Christmas gifts or examining hotel options online, is a ubiquitous task. However, such a task can often be difficult and time consuming for two main reasons. First, users often cannot use today’s tools to directly express their information needs. For example, using a conventional GUI, users may need to approximate U1 in Figure 1 by filling multiple forms then manually piecing together data gathered at different steps (e.g., finding the desired cities first, then using the cities to find houses). Moreover, most existing systems cannot directly respond to follow-on queries like U2. Without being able to easily revise their queries in context, users may need to start over. Second, users cannot easily digest the retrieved information from scattered, one-size-fits-all presentations. As a result, users themselves must integrate relevant information found at multiple places (e.g., separate displays of houses and cities) and manually extract information that they are interested in. To address the issues mentioned above, we envision context-sensitive information systems that allow users to express their requests and receive the requested information in context. Driven by this vision, we have built Responsive Information Architect (RIA), an intelligent information system that aids users in their information seeking from two aspects. First, RIA lets users specify and revise their requests in context using mixed modalities, including natural language and GUIs. Second, RIA automatically creates a system response that is tailored to a user's interaction context, Copyright 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. (a) (b) Figure 1: A recorded user-RIA conversation including the user’s interests and interaction history. Currently, RIA is embodied in four applications, including a residential real-estate application. In the rest of the paper, we present an overview of RIA technologies: context-sensitive input interpretation and automated output generation, and describe their use in support of realistic applications. We also highlight how to exploit the synergy between the input and output technologies to enable intelligent and robust user interaction, which is otherwise very difficult to achieve. RIA Overview Figure 2 shows RIA’s core components. Given a user request, a multimodal interpreter produces an interpretation result that captures both the intention and attention of the input. In Figure 1, the interpretation of U1 is to seek (intention) a set of houses satisfying three constraints (attention), including style, price, and location constraints. Given an interpretation result, the conversation facilitator suggests a set of conversation acts on the fly. Assume that U1 (Figure 1) results in a large data set. The facilitator may formulate two acts: Apologize (too much data to present) and DescribeTopN (e.g., showing the first N houses). Depend- 1691 language gesture ... Previous: Show colonials in Armonk Current: in Hawthorne with basement speech graphics ... in Hawthorne with basement Input Interpreter Conversation Facilitator Presentation Broker Media Producer Output Constraint 1: city == Hawthorne trainStation == ‘Hawthorne’ House House Constraint: style = ‘colonial’ Constraint 2: basement == ‘yes’ Visual Designer Data Conversation Context Speech Designer User model in City City Constraint 2 Constraint Constraint Constraint111 Figure 3: Partial semantic constructs of an input Environment model Figure 2: Overview of RIA architecture. ing on the context, the facilitator may Ask the user for additional constraints, such as her preferred house size, instead of showing any houses. Since conversation acts often do not specify the exact content (e.g., house attributes) or the form of a response, they are passed to the presentation broker for refinement. The presentation broker handles content selection/organization, media allocation, and media coordination, similar to the functions of the content layer defined in [2]. Consequently, it produces the outline of a response that defines the intended content and the media usage (e.g., using speech to express the number of retrieved houses and using graphics to depict house locations). Based on this outline, the visual designer and speech designer work together to create a coordinated multimedia response [9, 13]. The final response will then be sent to the media producer for rendering. RIA manages different types of information, including application data (e.g., houses in the real-estate domain), conversation history (exchanges between RIA and a user), user model (e.g., user preferences), and environment model (e.g., media capabilities). As described above, RIA relies on two sets of technologies: context-sensitive input interpretation [4, 6, 10, 9] and automated output generation [12, 14, 13]. Next, we highlight these technologies in an end-to-end RIA system. Context-Sensitive Input Interpretation To enable robust and accurate interpretation of diverse user data requests in context, we have employed three complementary strategies. First, we use a context-driven approach to optimize RIA interpretation by exploiting various contexts simultaneously, including data semantics, linguistic cues, and conversation history. Second, we provide system guidance in context to allow users and RIA to adapt to each other’s expressions and capability over time. Third, we leverage the strength of multiple modalities to achieve robust interpretation. allows users to incrementally revise their requests in context instead of specifying a complete query every time, it must derive the full meanings of incomplete (e.g., “the cheapest”) or imprecise requests (“what about Pleasantville”). To do so, RIA exploits various contexts, including data semantics, linguistic cues, and conversation history [3, 6]. For example, to handle the follow-on query in Figure 3, RIA uses the conversation history to infer that the term “hawthorne” refers to a city instead of a train station, and uses the data semantics to attach the basement constraint to the house concept. As a result, RIA can handle a wide range of user expressions regardless their syntactic forms, ranging from keywords (e.g., “colonials 3+ bedrooms”) to complete English sentences, all in context. Such flexibility is critical in a practical application, where RIA must cope with diverse user input styles, and tolerate imperfect user input (e.g., abbreviated and ungrammatical expressions). In addition, our approach is easily portable, since it does not require a large training corpus or a large set of interpretation rules. Adaptive interpretation. Despite our effort described above to help achieve more accurate and robust interpretation, RIA’s interpretation capability may still be insufficient for many real-world applications. Thus, we build a two-way adaptation engine that allows users and RIA to dynamically adapt to each other’s expressions over the course of interaction [10]. For example, RIA failed to understand U3 in Figure 4 initially due to vague expressions like big and good. It suggested five alternative queries to approximate the user’s original request. Among the recommended queries, the user may select one from the list (such as U3’) and then edit it to suit her needs. Accepting the revised query and observing the pair of the original and the revised queries over time, RIA can then use this association to handle a similar unsupported request like “Find big ranches in good school districts”. Consequently, the adaptation enhances the usability of RIA by training a novice user to work within Context-driven interpretation. RIA focuses on identifying the semantic constructs of an input (e.g., data concepts and constraints) and the relationships between the constructs (e.g., relationships between two data concepts). Since RIA 1692 U3: Big houses in good school districts R3: I am sorry... There are certain words I don’t understand. Here are the possible queries: U3 Figure 4: An example of adaptive interpretation. U4: Show ranches under $800K in Armonk. User interests: financial and exterior U5:Show ranches under $800k in Armonk. User interests: size and amenity U6:Show houses under $1M in Armonk. Data volume impacts content selection Figure 5. An example of context-sensitive presentation content selection RIA’s capability. In addition, RIA improves its own interpretation capability through self-adaptation, minimizing the overall effort of developing an effective interaction system. Leveraging GUIs and language inputs. To leverage the strengths of different input modalities, we have explored the usage of GUIs to complement language inputs. In particular, users can use a GUI to explicitly control the flow of a conversation. By default, RIA interprets a user request in the context of previous requests. However, users may break from the previous conversation flow without explicitly signaling it in their language input. Consider a user input “show houses in Armonk” after U2 in Figure 1. It is unclear whether RIA should inherit the previous house constraints or simply start fresh. While RIA is able to detect some of these breaks, it also lets users use a GUI button to explicitly signal the start of a new flow. Moreover, users can also use different GUI buttons to interrupt a RIA response (barging in), to start over (wiping out the entire conversation history), and to go back to the previous turn. content selection and media allocation [12, 14]. In this framework, we uniformly model all factors as presentation desirability/cost constraints (e.g., a presentation cost constraint derived from device properties). We then use optimization-based algorithms to maximize the satisfaction of all constraints. For example, we use a graph-matching algorithm to allocate media by maximizing the satisfaction of all allocation constraints [14]. As a result, RIA optimizes the content and media selection by dynamically balancing all relevant factors (e.g., Figures 5–6). Moreover, our approaches can be easily extended to cover new situations, since adding a new factor does not require modification of the underlying algorithms. Example-based output design. Similar to the reasons listed above, it is also impractical to use a rule/plan based approach [1, 5, 8, 7] to create media-specific outputs in RIA. Instead, we employ case-based learning to create both visual and verU7 Speech: How much are those? RIA Speech: Their prices are shown below. Automated, Customized Output Generation RIA is designed to support a highly dynamic user conversation, where it is impractical to plan in advance the content and form of all possible RIA responses. To tailor RIA responses to a user’s interaction context, we develop a suite of automated response generation technologies. Specifically, we devise an optimization-based framework to select response content [12] and allocate suitable media [14]. We combine machine learning with other approaches to dynamically synthesize verbal and visual responses [9, 13]. Optimization-based content and media selection . A user interaction context consists of a number of factors, including query expressions and conversation history. Any subtle variations of these factors, such as changes in data volume or query patterns, often require different response content or presentation media to be used. Figure 5 shows that the variations in user preferences and data volume impact content selection. Figure 6 shows that data volume may also impact the use of different media. Since it may require an exhaustive set of rules/plans to handle all situations, it is impractical to use conventional rule/plan-based approaches in RIA. Instead, we develop an optimization-based framework for 1693 (a) Multiple prices are conveyed visually and referred verbally. U8 Speech: How much is this? <Point to a house on the screen> RIA Speech: The price of this house is $659900. (b) A single price is conveyed both verbally and visually. Figure 6. Data volume impacts media usage. bal responses from a set of graphics and English sentence examples, respectively. Our learning engines not only can reuse suitable examples, but they can also compose new forms of outputs by dynamically combining different example fragments [9, 13]. As a result, RIA can cover a wide range of interaction situations using only a small number of examples. For example, it uses about 20 visual examples and 200 sentence examples for our real estate application that covers 25+ concepts, each with a number of attributes (e.g., a house has 40 attributes). The usage of a small example set helps to set up a system quickly. Moreover, we can easily extend RIA capability by adding new examples. Nonetheless, a case-based learning engine alone is inadequate in meeting all RIA’s needs. For example, it is inefficient to use case-based learning to abstract sentence aggregation rules, since it would require a large number of examples [9]. Similarly, case-based learning is not suitable for learning precise visual arrangements (e.g., exact positions and sizes). Such parameters must be recomputed for specific visual scenes. Therefore, we use other approaches to fine tune presentation details. Specifically, our speech designer uses a rule-based approach to handle sentence revision [9]; and our visual designer uses various other means, including constraint-based and optimization-based methods, to fine tune visual layout and manage visual context [11]. websites. In addition, 16 out of 18 novice users completed their target search tasks using RIA. However, none of the existing site can directly satisfy all 9 constraints without manually stitching requested information at different steps. Based on our evaluation and user feedback, we conclude that a RIA-like context-sensitive interaction paradigm greatly aids users in their information seeking. As demonstrated in our study, RIA especially empowers “power users”, who can use RIA to accomplish open-ended, rather challenging tasks in real world conditions, which are extremely difficult to achieve using existing tools. Our work also demonstrates the practicality of our approaches for realworld scenarios. As a result, we expect more RIA-like systems to be built for a wide range of applications, which will significantly improve today’s information search and browsing paradigm. References [1] E. Andre and T. Rist. Generating coherent presentations employing textual and visual material. AI Review, 9:147–165, 1995. [2] M. Bordegoni, G. Faconti, S. Feiner, M. Maybury, T. Rist, R. Ruggieri, P. Trahanias, and M. Wilson. A standard ref model for intelli. multimedia presentation sys. Comp. Standards and Interfaces, 18(6-7):477–496, 1997. Context-sensitive response design using input features. A better understanding of a user input helps to create a more tailored response. To tailor its responses to a specific user interaction flow, RIA leverages its fine-grained interpretation results, especially the meta features derived from a user request. Here we illustrate the use of one such feature. [3] J. Chai, P. Hong, and M. Zhou. A probabilistic approach to reference resolution in multimodal user interfaces. In Proc. IUI 04, pp 70–77, 2004. Feature followup is derived during RIA input interpretation to signal whether a given user request is new or a continuation of a previous request. In Figure 1, U2 is a follow on of U1, since it inherits certain data constraints specified in U1. To maintain the desired level of semantic continuity between follow-up requests, the visual designer uses this feature to compute the amount of visual content overlap between two successive visual responses. In general, RIA maximizes the overlap between follow-on requests, while reducing the overlap when a new flow starts [11]. [5] S. Feiner and K. McKeown. Automating the generation of coordinated multimedia. IEEE Computer, 24(10):33–41, 1994. [4] J. Chai, S. Pan, M. Zhou, and K. Houck. Context-based multimodal input understanding in conversation systems. In Proc. ICMI 02, pp 87–92. [6] K. Houck. Contextual revision in information-seeking conversation systems. In Proc. ICSLP 04, 2004. [7] S. Kerpedjiev, G. Carenini, S. Roth, and J. Moore. Integrating planning and task-based design for multimedia presentation. In IUI 97, pp 145–152. [8] M. Maybury. Planning multimedia explanations using communicative acts. In M. Maybury, editor, Intelligent Multimedia Interfaces, chapter 2, pages 60–74. AAAI, 1993. [9] S. Pan and J. Shaw. SEGUE: A hybrid case-based natural language generator. In Proc. INLG 04, pp 130–140, 2004. Evaluation and Conclusions We have evaluated RIA in realistic usage scenarios. In a recent study, we tested on two scenarios. We asked power users to perform open-ended information search and browsing tasks using human dynamically specified search criteria; and asked novice users to complete a target search task satisfying 9 criteria. We compared RIA performance with that of multiple online real-estate websites (e.g., realtor.com). Overall, RIA performed adequately in both tasks. Among the 181 search criteria that users dynamically specified, RIA satisfied 79% of them (143/181), compared to 35% satisfaction rate achieved by any combination of online realestate sites. Moreover, RIA satisfied 97% of top 5 categories of user-specified criteria versus 61% achieved by the online [10] S. Pan, S. Shen, M. Zhou, and K. Houck. Two-way adaptation for robust input interpretation in practical multimodal conversation systems. In IUI 05, pages 25–32. [11] Z. Wen, M. Zhou, and V. Aggarwal. An optimization-based approach to dynamic visual context management. In Proc. InfoVis 05, pp 187–194. [12] M. Zhou and V. Aggarwal. An optimization-based approach to dynamic data content selection in intelligent multimedia interfaces. In Proc. UIST 04, pp 227–236. [13] M. Zhou and M. Chen. Automated generation of graphic sketches by examples. In IJCAI 03, pp 65–71. [14] M. Zhou, Z. Wen, and V. Aggarwal. A graph matching approach to dynamic media allocation in intelligent multimedia interfaces. In IUI 05, pp 114–121. 1694