Audiovisual Attentive User Interfaces

Audiovisual Attentive User
Attending to the needs and actions
of the user
Paulina Modlitba
T-121.900 Seminar on User Interfaces and Usability
What is an
Attentive User Interface? (1/2)
• Negotiate the timing and volume of
communication with the user
• Use specific input, output and turn-taking
techniques to determine what task, device
or person a user is attending to
• User’s presence, orientation, speech
activity and gaze and statistically modeling
attention and interaction are detected
What is an
Attentive User Interface? (2/2)
• Four characteristic components
– visual attention
– turn-taking techniques
– modeling techniques for the attention
– focus and context displays and visualisation
• Dürsteler (2003)
Why are they needed?
• Roel Vertegaal (2003)
• Multiple ubiquitous computing devices lead to a
growing demands on users’ attention
• Metaphor: modern traffic light system
– Sensors
– Statistical models of traffic volume
– Peripheral displays (traffic lights)
• Disruptive effect of interruptions can be avoided
Evolution of human-machine
1960s-1980s: many-one
1990s-2000s: one-many
1980s-1990s: one-one
2000s-2010s: many-many
Visual attention
• Eye-gaze tracking: detecting the user’s visual
focus of attention
• Operate by sending an infrared light source
toward the user’s eye
• Provides information about the context
• Central I/O channel in communication
• Limitations in existing hardware/software
• Biological limitations
Reasons for implementing gaze
• Kaur et al. (2003)
• The gaze location is the only reliable predictor of
the locus of visual attention
• Gaze can be used as a “natural” mode of input
that avoids the need for learned hand-eye
• Gaze selection of screen objects is expected to
be significantly faster than the traditional handeye coordination
• Gaze allows for hands-free interaction
Current issues
• Limited size of fovea (1-3°)
• Subconscious eye movements
• Eyes are not control organs (Zhai et al.,
• No natural analogy to current input
devices, e.g. mouse
• Gaze is always active (Kaur et al., 2003)
Current state
• Eye-gaze control used as an additional
input channel
• Provides context to the action
• Combined with manual input gaze tracking
can improve the robustness and reliability
of a system
EASE Chinese Input (1/2)
• Zhai et al. (2002)
• Supports pinyin type-writing
– official Chinese phonetic alphabet based on
Roman characters
– Chinese characters are homophonic - each
syllable corresponds to several Chinese
– When the user types the pinyin of a character, a
number of possible characters with the same
pronunciation are displayed
EASE Chinese Input (2/2)
• Normally, user chooses a character by pressing a
number on the keyboard
• With EASE user only has to press the spacebar as soon
as he or she sees the wished-for character in the list
• The system selects the character closest to the user’s
current gaze location
Speech recognition (1/2)
• Limited technology, despite extensive
research and progress
• Crucial issues
– error rate of speech recognition engines and
how these errors can be reduced
– the effort required to port the speech
technology applications between different
application domains or languages (Deng &
Huang, 2004)
Speech recognition (2/2)
• Three directions for enhancing the
– improve the microphone ergonomics for
enhancing the signal-to-noise ratio
– equipping speech recognizers with the ability
to learn and to correct errors
– add semantic (meaning) and pragmatic
(application context) knowledge (Deng &
Huang, 2004)
Multimodal interfaces
• Can provide more natural human-machine
• Improves the robustness of the interaction
by using redundant or complementary
• Today: usually gaze/speech + manual
control (e.g. mouse)
• Future: gaze + speech, gaze, speech
Main issue
• Shumin Zhai (2003)
• “We need to design unobtrusive,
transparent and subtle turn-taking
processes that coordinate attentive input
with the user’s explicit input in order to
contribute to the user’s goal without the
burden of explicit dialogues.”
Manual and Gaze Input
Cascaded (MAGIC) Pointing
• interaction technique that utilizes eye
movement to assist the control task
• Zhai et al. have constructed two MAGIC
pointing techniques, one liberal and one
conservative (Zhai et al., 1999)
Liberal approach (1/2)
• The cursor is warped to every new object that
the user looks at
• The user can then manually take control of the
cursor near (or on) the target, or ignore it and
search for the next target
• New target defined by distance (e.g. 120 pixels)
from the current cursor position
• Issues: pro-active (cursor waits readily);
overactive (gaze enough to move cursor)
Liberal approach (2/2)
Conservative approach (1/2)
• Warps the cursor to a target when the manual
input device has been actuated
• Once moved, the cursor appears in motion
towards the target
• Hence, the cursor never jumps directly to a
target that the user does not intend to obtain
• May be slower than the liberal approach
Conservative approach (2/2)
• Bradbury et al. (2003)
• Multimodal attentive cookbook that helps unaccustomed
computer users cook a meal
• User interacts with the eyeCOOK system by using eyegaze and speech commands
• System responds visually and verbally
• The system replaces the object of the user’s gaze with
the word “this”
• If the user’s gaze can not be tracked by the eyeCOOK
system the user has to specify the target verbally
EyeCOOK in
Page Display Mode
• Vertegaal et al, 2003
• A new group video conferencing system that
uses gaze-controlled cameras to convey eyecontact
• Consists of a video tunnel that makes it possible
to place cameras behind the participant images
on the screen
• system automatically directs the video cameras
in this tunnel using a gaze tracker by selecting
the camera closest to the user’s current focus of
attention (gaze location)
GAZE-2 system structure
3D rendering
• The 2D video images of the participants are
displayed in a 3D virtual meeting room and are
automatically rotated to face the participant each
user is looking at.
• In the picture bellow, everyone is looking at the left
person, who’s image is broadcasted in a higher
Turn-taking in video conferencing
• Misunderstandings cause interruptions
• Eye contact plays an important role in
turn-taking (Vertegaal, et al., 2003)
Vertegaal, et al., 2003
Bradbury et al. (2003)
Zhai et al., 1999
Dürsteler (2003)
Vertegaal (2003)
Kaur et al. (2003)
Shumin Zhai (2003)
Zhai et al. (2002)
• (Deng & Huang, 2004)
Things missing
Are attentive user interfaces better in following the user in
order to "capture his/her context" to make proactive actions for him/her,
or are they better used as input devices (an approach you take).
The distinction between explicit and implicit input, as presented by
Horvitz (you can find a link from the seminar homepage), is thus important
here and could give you benefit.
Please take some real world examples of prototypes and real situations
to your presentation. This makes grasping the idea better and arguing
more concrete. You might consider presenting other application ideas as
well as the ones already in the paper.
I think you would benefit from considering in more detail, for each
particular application, why attention and preferences are tracked and
how they might be combined, effectively, to minimize disruption and make
interaction more fluent. Binding the presentation more tightly to the
"let's make interruptions go away" theme of the seminar is important here.
Consequently, the presentation, it would be nice to see your analysis of
"how things were" and "how things are" (now with AUIs).
Working memory
Long-time memory
Task resumptions
Social interaction