Document 13692004

advertisement
Responsive Information Architect: Enabling Context-Sensitive
Information Seeking
Michelle X. Zhou Keith Houck Shimei Pan James Shaw Vikram Aggarwal Zhen Wen
IBM T. J. Watson Research Center
19 Skyline Drive, Hawthorne, NY 10532
{mzhou, khouck, shimei, shawjc, vikram, zhenwen}@watson.ibm.com
Abstract1
U1 Text: Show colonials over $3M in cities along Hudson
Information seeking is an important but often difficult task
especially when involving large and complex data sets. We
hypothesize that a context-sensitive interaction paradigm
can greatly assist users in their information seeking. Such a
paradigm allows a system to both understand user data
requests and present the requested information in context.
Driven by this hypothesis, we have developed a suite of
intelligent user interaction technologies and integrated
them in a full-fledged, context-sensitive information system. In this paper, we review two sets of key technologies:
context-sensitive multimodal input interpretation and automated multimedia output generation. We also share our
evaluation results, which indicate that our approaches are
capable of supporting context-sensitive information seeking for practical applications.
R1 Speech: I found two houses.
Graphics: Display (a)
U2 Text: Tell me more about the cities
R2 Speech: Here is the information of Briarcliff Manor and Irvington
Graphics: Display (b)
Introduction
Information seeking, such as finding Christmas gifts or
examining hotel options online, is a ubiquitous task. However, such a task can often be difficult and time consuming
for two main reasons. First, users often cannot use today’s
tools to directly express their information needs. For example, using a conventional GUI, users may need to approximate U1 in Figure 1 by filling multiple forms then manually
piecing together data gathered at different steps (e.g., finding
the desired cities first, then using the cities to find houses).
Moreover, most existing systems cannot directly respond to
follow-on queries like U2. Without being able to easily revise
their queries in context, users may need to start over. Second,
users cannot easily digest the retrieved information from
scattered, one-size-fits-all presentations. As a result, users
themselves must integrate relevant information found at
multiple places (e.g., separate displays of houses and cities)
and manually extract information that they are interested in.
To address the issues mentioned above, we envision
context-sensitive information systems that allow users to
express their requests and receive the requested information
in context. Driven by this vision, we have built Responsive
Information Architect (RIA), an intelligent information system that aids users in their information seeking from two
aspects. First, RIA lets users specify and revise their requests
in context using mixed modalities, including natural language and GUIs. Second, RIA automatically creates a system response that is tailored to a user's interaction context,
Copyright 2006, American Association for Artificial Intelligence
(www.aaai.org). All rights reserved.
(a) (b)
Figure 1: A recorded user-RIA conversation
including the user’s interests and interaction history. Currently, RIA is embodied in four applications, including a residential real-estate application. In the rest of the paper, we
present an overview of RIA technologies: context-sensitive
input interpretation and automated output generation, and
describe their use in support of realistic applications. We also
highlight how to exploit the synergy between the input and
output technologies to enable intelligent and robust user
interaction, which is otherwise very difficult to achieve.
RIA Overview
Figure 2 shows RIA’s core components. Given a user
request, a multimodal interpreter produces an interpretation
result that captures both the intention and attention of the
input. In Figure 1, the interpretation of U1 is to seek (intention) a set of houses satisfying three constraints (attention),
including style, price, and location constraints.
Given an interpretation result, the conversation facilitator suggests a set of conversation acts on the fly. Assume
that U1 (Figure 1) results in a large data set. The facilitator
may formulate two acts: Apologize (too much data to present)
and DescribeTopN (e.g., showing the first N houses). Depend-
1691
language
gesture
...
Previous: Show colonials in Armonk
Current: in Hawthorne with basement
speech
graphics
...
in Hawthorne with basement
Input
Interpreter
Conversation
Facilitator
Presentation
Broker
Media
Producer
Output
Constraint 1:
city == Hawthorne
trainStation == ‘Hawthorne’
House
House
Constraint:
style = ‘colonial’
Constraint 2:
basement == ‘yes’
Visual
Designer
Data
Conversation
Context
Speech
Designer
User model
in
City
City
Constraint 2
Constraint
Constraint
Constraint111
Figure 3: Partial semantic constructs of an input
Environment
model
Figure 2: Overview of RIA architecture.
ing on the context, the facilitator may Ask the user for additional constraints, such as her preferred house size, instead
of showing any houses.
Since conversation acts often do not specify the exact
content (e.g., house attributes) or the form of a response,
they are passed to the presentation broker for refinement.
The presentation broker handles content selection/organization, media allocation, and media coordination, similar to the
functions of the content layer defined in [2]. Consequently, it
produces the outline of a response that defines the intended
content and the media usage (e.g., using speech to express
the number of retrieved houses and using graphics to depict
house locations). Based on this outline, the visual designer
and speech designer work together to create a coordinated
multimedia response [9, 13]. The final response will then be
sent to the media producer for rendering.
RIA manages different types of information, including
application data (e.g., houses in the real-estate domain), conversation history (exchanges between RIA and a user), user
model (e.g., user preferences), and environment model (e.g.,
media capabilities).
As described above, RIA relies on two sets of technologies: context-sensitive input interpretation [4, 6, 10, 9] and
automated output generation [12, 14, 13]. Next, we highlight
these technologies in an end-to-end RIA system.
Context-Sensitive Input Interpretation
To enable robust and accurate interpretation of diverse user
data requests in context, we have employed three complementary strategies. First, we use a context-driven approach
to optimize RIA interpretation by exploiting various contexts simultaneously, including data semantics, linguistic
cues, and conversation history. Second, we provide system
guidance in context to allow users and RIA to adapt to each
other’s expressions and capability over time. Third, we
leverage the strength of multiple modalities to achieve
robust interpretation.
allows users to incrementally revise their requests in context
instead of specifying a complete query every time, it must
derive the full meanings of incomplete (e.g., “the cheapest”)
or imprecise requests (“what about Pleasantville”). To do so,
RIA exploits various contexts, including data semantics, linguistic cues, and conversation history [3, 6]. For example, to
handle the follow-on query in Figure 3, RIA uses the conversation history to infer that the term “hawthorne” refers to a
city instead of a train station, and uses the data semantics to
attach the basement constraint to the house concept.
As a result, RIA can handle a wide range of user expressions regardless their syntactic forms, ranging from keywords (e.g., “colonials 3+ bedrooms”) to complete English
sentences, all in context. Such flexibility is critical in a practical application, where RIA must cope with diverse user
input styles, and tolerate imperfect user input (e.g., abbreviated and ungrammatical expressions). In addition, our
approach is easily portable, since it does not require a large
training corpus or a large set of interpretation rules.
Adaptive interpretation. Despite our effort described above
to help achieve more accurate and robust interpretation,
RIA’s interpretation capability may still be insufficient for
many real-world applications. Thus, we build a two-way
adaptation engine that allows users and RIA to dynamically
adapt to each other’s expressions over the course of interaction [10]. For example, RIA failed to understand U3 in
Figure 4 initially due to vague expressions like big and good.
It suggested five alternative queries to approximate the
user’s original request. Among the recommended queries,
the user may select one from the list (such as U3’) and then
edit it to suit her needs. Accepting the revised query and
observing the pair of the original and the revised queries
over time, RIA can then use this association to handle a similar unsupported request like “Find big ranches in good
school districts”. Consequently, the adaptation enhances the
usability of RIA by training a novice user to work within
Context-driven interpretation. RIA focuses on identifying
the semantic constructs of an input (e.g., data concepts and
constraints) and the relationships between the constructs
(e.g., relationships between two data concepts). Since RIA
1692
U3: Big houses in good school districts
R3: I am sorry... There are certain words I don’t understand. Here are the
possible queries:
U3
Figure 4: An example of adaptive interpretation.
U4: Show ranches under $800K in Armonk.
User interests: financial and exterior
U5:Show ranches under $800k in Armonk.
User interests: size and amenity
U6:Show houses under $1M in Armonk.
Data volume impacts content selection
Figure 5. An example of context-sensitive presentation content selection
RIA’s capability. In addition, RIA improves its own interpretation capability through self-adaptation, minimizing the
overall effort of developing an effective interaction system.
Leveraging GUIs and language inputs. To leverage the
strengths of different input modalities, we have explored the
usage of GUIs to complement language inputs. In particular,
users can use a GUI to explicitly control the flow of a conversation. By default, RIA interprets a user request in the
context of previous requests. However, users may break
from the previous conversation flow without explicitly signaling it in their language input. Consider a user input “show
houses in Armonk” after U2 in Figure 1. It is unclear whether
RIA should inherit the previous house constraints or simply
start fresh. While RIA is able to detect some of these breaks,
it also lets users use a GUI button to explicitly signal the start
of a new flow. Moreover, users can also use different GUI
buttons to interrupt a RIA response (barging in), to start over
(wiping out the entire conversation history), and to go back
to the previous turn.
content selection and media allocation [12, 14]. In this
framework, we uniformly model all factors as presentation
desirability/cost constraints (e.g., a presentation cost constraint derived from device properties). We then use optimization-based algorithms to maximize the satisfaction of all
constraints. For example, we use a graph-matching algorithm to allocate media by maximizing the satisfaction of all
allocation constraints [14]. As a result, RIA optimizes the
content and media selection by dynamically balancing all
relevant factors (e.g., Figures 5–6). Moreover, our
approaches can be easily extended to cover new situations,
since adding a new factor does not require modification of
the underlying algorithms.
Example-based output design. Similar to the reasons listed
above, it is also impractical to use a rule/plan based approach
[1, 5, 8, 7] to create media-specific outputs in RIA. Instead,
we employ case-based learning to create both visual and verU7 Speech: How much are those?
RIA Speech: Their prices are shown below.
Automated, Customized Output Generation
RIA is designed to support a highly dynamic user conversation, where it is impractical to plan in advance the content
and form of all possible RIA responses. To tailor RIA
responses to a user’s interaction context, we develop a suite
of automated response generation technologies. Specifically,
we devise an optimization-based framework to select
response content [12] and allocate suitable media [14]. We
combine machine learning with other approaches to dynamically synthesize verbal and visual responses [9, 13].
Optimization-based content and media selection . A user
interaction context consists of a number of factors, including
query expressions and conversation history. Any subtle variations of these factors, such as changes in data volume or
query patterns, often require different response content or
presentation media to be used. Figure 5 shows that the variations in user preferences and data volume impact content
selection. Figure 6 shows that data volume may also impact
the use of different media. Since it may require an exhaustive
set of rules/plans to handle all situations, it is impractical to
use conventional rule/plan-based approaches in RIA.
Instead, we develop an optimization-based framework for
1693
(a) Multiple prices are conveyed visually and referred verbally.
U8 Speech: How much is this? <Point to a house on the screen>
RIA Speech: The price of this house is $659900.
(b) A single price is conveyed both verbally and visually.
Figure 6. Data volume impacts media usage.
bal responses from a set of graphics and English sentence
examples, respectively. Our learning engines not only can
reuse suitable examples, but they can also compose new
forms of outputs by dynamically combining different example fragments [9, 13]. As a result, RIA can cover a wide
range of interaction situations using only a small number of
examples. For example, it uses about 20 visual examples and
200 sentence examples for our real estate application that
covers 25+ concepts, each with a number of attributes (e.g., a
house has 40 attributes). The usage of a small example set
helps to set up a system quickly. Moreover, we can easily
extend RIA capability by adding new examples.
Nonetheless, a case-based learning engine alone is inadequate in meeting all RIA’s needs. For example, it is inefficient to use case-based learning to abstract sentence
aggregation rules, since it would require a large number of
examples [9]. Similarly, case-based learning is not suitable
for learning precise visual arrangements (e.g., exact positions and sizes). Such parameters must be recomputed for
specific visual scenes. Therefore, we use other approaches to
fine tune presentation details. Specifically, our speech
designer uses a rule-based approach to handle sentence revision [9]; and our visual designer uses various other means,
including constraint-based and optimization-based methods,
to fine tune visual layout and manage visual context [11].
websites. In addition, 16 out of 18 novice users completed
their target search tasks using RIA. However, none of the
existing site can directly satisfy all 9 constraints without
manually stitching requested information at different steps.
Based on our evaluation and user feedback, we conclude
that a RIA-like context-sensitive interaction paradigm
greatly aids users in their information seeking. As demonstrated in our study, RIA especially empowers “power
users”, who can use RIA to accomplish open-ended, rather
challenging tasks in real world conditions, which are
extremely difficult to achieve using existing tools. Our work
also demonstrates the practicality of our approaches for realworld scenarios. As a result, we expect more RIA-like systems to be built for a wide range of applications, which will
significantly improve today’s information search and browsing paradigm.
References
[1] E. Andre and T. Rist. Generating coherent presentations
employing textual and visual material. AI Review, 9:147–165, 1995.
[2] M. Bordegoni, G. Faconti, S. Feiner, M. Maybury, T. Rist,
R. Ruggieri, P. Trahanias, and M. Wilson. A standard ref model for
intelli. multimedia presentation sys. Comp. Standards and Interfaces, 18(6-7):477–496, 1997.
Context-sensitive response design using input features. A
better understanding of a user input helps to create a more
tailored response. To tailor its responses to a specific user
interaction flow, RIA leverages its fine-grained interpretation results, especially the meta features derived from a user
request. Here we illustrate the use of one such feature.
[3] J. Chai, P. Hong, and M. Zhou. A probabilistic approach to reference resolution in multimodal user interfaces. In Proc. IUI 04, pp
70–77, 2004.
Feature followup is derived during RIA input interpretation to signal whether a given user request is new or a continuation of a previous request. In Figure 1, U2 is a follow on of
U1, since it inherits certain data constraints specified in U1.
To maintain the desired level of semantic continuity between
follow-up requests, the visual designer uses this feature to
compute the amount of visual content overlap between two
successive visual responses. In general, RIA maximizes the
overlap between follow-on requests, while reducing the
overlap when a new flow starts [11].
[5] S. Feiner and K. McKeown. Automating the generation of coordinated multimedia. IEEE Computer, 24(10):33–41, 1994.
[4] J. Chai, S. Pan, M. Zhou, and K. Houck. Context-based multimodal input understanding in conversation systems. In Proc. ICMI
02, pp 87–92.
[6] K. Houck. Contextual revision in information-seeking conversation systems. In Proc. ICSLP 04, 2004.
[7] S. Kerpedjiev, G. Carenini, S. Roth, and J. Moore. Integrating
planning and task-based design for multimedia presentation. In IUI
97, pp 145–152.
[8] M. Maybury. Planning multimedia explanations using communicative acts. In M. Maybury, editor, Intelligent Multimedia Interfaces, chapter 2, pages 60–74. AAAI, 1993.
[9] S. Pan and J. Shaw. SEGUE: A hybrid case-based natural language generator. In Proc. INLG 04, pp 130–140, 2004.
Evaluation and Conclusions
We have evaluated RIA in realistic usage scenarios. In a
recent study, we tested on two scenarios. We asked power
users to perform open-ended information search and browsing tasks using human dynamically specified search criteria;
and asked novice users to complete a target search task satisfying 9 criteria. We compared RIA performance with that of
multiple online real-estate websites (e.g., realtor.com).
Overall, RIA performed adequately in both tasks.
Among the 181 search criteria that users dynamically specified, RIA satisfied 79% of them (143/181), compared to 35%
satisfaction rate achieved by any combination of online realestate sites. Moreover, RIA satisfied 97% of top 5 categories
of user-specified criteria versus 61% achieved by the online
[10] S. Pan, S. Shen, M. Zhou, and K. Houck. Two-way adaptation
for robust input interpretation in practical multimodal conversation
systems. In IUI 05, pages 25–32.
[11] Z. Wen, M. Zhou, and V. Aggarwal. An optimization-based
approach to dynamic visual context management. In Proc. InfoVis
05, pp 187–194.
[12] M. Zhou and V. Aggarwal. An optimization-based approach to
dynamic data content selection in intelligent multimedia interfaces.
In Proc. UIST 04, pp 227–236.
[13] M. Zhou and M. Chen. Automated generation of graphic
sketches by examples. In IJCAI 03, pp 65–71.
[14] M. Zhou, Z. Wen, and V. Aggarwal. A graph matching
approach to dynamic media allocation in intelligent multimedia
interfaces. In IUI 05, pp 114–121.
1694
Download