Multi-Level Input Fusion in Multi-Modal Systems

advertisement
MULTI-LEVEL INPUT FUSION IN MULTI-MODAL SYSTEMS
A.A. Khan, A. Badii
Intelligent Media Systems and Services Research Laboratory (IMSS)
School of Systems Engineering
University of Reading, UK
{a.a.khan, atta.badii}@reading.ac.uk
ABSTRACT
Context-aware multimodal interactive systems aim to
adapt to the needs and behavioural patterns of users and
offer a way forward for enhancing the efficacy and
quality of experience (QoE) in human-computer
interaction. The various modalities that constribute to
such systems each provide a specific uni-modal response
that is integratively presented as a multi-modal interface
capable of interpretation of multi-modal user input and
appropriately responding to it through dynamically
adapted multi-modal interactive flow management , This
paper presents an initial background study in the context
of the first phase of a PhD research programme in the
area of optimisation of data fusion techniques to serve
multimodal interactivite systems, their applications and
requirements.
1. INTRODUCTION
The advent of widespread commercial availability of
technology has tended to shift the human-computer
interaction paradigm from a traditional PC (keyboard
and mouse input) based usage to a more natural and
flexible manner in which users may interact with the
system.
Such systems comprise of a number of
modalities by which they may acquire user input to
perform a function requested by the user. The collection
of input from various modalities enables the
development of intelligent systems that are contextaware and aids in the shift from traditional systems to a
more natural approach for interaction and usability [4].
For example a mix of gestures and speech can provide a
more user-friendly manner in which users may interact
with the system.
2. MULTI-MODAL SYSTEMS
Multimodal systems provide increased accuracy and
precision (and in turn improved reliability) in terms of
context-awareness and situation assessment by
incorporating information from a number of input
modalities. This approach offers a kind of fault-tolerant
way of managing modalities. In the case that one of the
modalities fails or contains noisy data, information from
other modalities can be used to minimise ambiguity re
situation assessment arising from a failed or noisy
modality. The improvement in more reliable contextsensing and thus more appropriately responsive
behaviour by the interactive system is said to be a likely
outcome of multimodal fusion [2].
Various application areas exist for such multimodal
systems that can make use of feasible hardware devices
to acquire input not only at a neuro-motor but also at a
physiological level for instance: monitoring heart rate,
perspiration, blood pressure etc. in a patient monitoring
system.
Further examples of multimodal systems
include anomaly detection, assisted living, PDA with
pen-gestures, speech and graphics input capability etc.
2.1.
Situation Assessment
According to Lambert [9, 20, 21, 22, 23], situation
assessment involves assessment of situations where
situations are “fragments of the world represented by a
set of assertions”. This differs from object recognition in
the sense that it requires a shift in the procedure from
numeric to symbolic representations, a problem coined
as the semantic challenge for information fusion in [23].
Lambert [9] proposes an interesting approach to
semantic fusion using a formal theory by following a
development path that involves sequential construction
of the problem in terms of philosophy, mathematics and
computation. This approach is illustrated using a formal
theory for existence in Lambert’s work [9].
3. INPUT FUSION
There exist two main approaches in literature for
integration and fusion of multimodal input according to
Corradini et al [2] namely:

Integration of signal at feature level,

Integration of information at semantic level.
The feature level signal fusion, also referred to lower
level fusion is related to “closely coupled and
synchronised” [2] modalities such as speech and lip
movements. This does not scale with ease, requires
extensive training data sets and incurs high
computational costs [2].
Higher level symbolic or semantic fusion on the
other hand, is related to modalities that “differ in time
scale characteristics of their features” [2]. This entails
time stamping of all modalities to aid the fusion process.
Semantic fusion involves a number of benefits such as
off-the-shelf usage, reusability, simplicity etc. [2].
Semantic fusion is a process that unifies input at a
“meaning level” from various modalities that are part of
a multimodal system [3]. It is said to occur in two steps:
a) input events for a user’s command are taken from
various modalities and fused at a low level to form a
single multimodal input event that signifies the user’s
command; b) this multimodal input event is then
semantically interpreted at a higher level to deduce the
intended meaning of the user’s action by “extracting and
combining information chunks” [3].
4. MULTI-MODAL FUSION
The main aim of multimodal semantic input fusion is
to make sense of the user’s intended meaning by fusing
the partial streams of input that constitute a user’s
command from various input modalities of the system.
Hurtig and Jokinen [4] cite a “classical example of
coordinated multimodal input” as the Put-that-there
system proposed by Bolt [13]. This system presented a
scenario where the user of a system specified input
commands in a natural manner by way of speech and
hand gestures.
According to [4], multimodal input fusion can be
categorised as a process that occurs over three levels: i)
signal-level, ii) feature-level and iii) semantic, where
semantic fusion integrates the understanding acquired
from inputs of various individual modalities into a single
comprehension set that signifies the intended meaning of
the user’s input.
Hurtig and Jokinen [4] propose a three level semantic
fusion component which involves temporal fusion that
creates combinations of input data, statistically
motivated weighting of these combinations, and, a
discourse level phase that selects the best candidate. The
weighting process uses three parameters namely overlap,
proximity and concept type of the multimodal input
events [4]. Overlapping and proximity are related to
temporal placement of the input events where proximity
is said to play a very important role especially in
modalities such as speech and tactile data [4]. Concept
type refers to user commands which would be
incomplete if the system were to consider only one input
modality e.g. a user pointing to a location and speaking a
phrase that only incompletely describes the location [4].
Once the weighted ranking has been performed, a list of
these candidates is passed on to the discourse level phase
that selects the best ranked final candidate for system
response construction. If no candidate fits, the system
requests the user to repeat his/her command [4].
Multimodal fusion encapsulates the union of a
number of data from a number of input channels in a
multimodal interactive system [12] where each channel
may represent a distinct modality whereby a user may
interact actively or passively with the system. Active
input can be categorised as a direct usage of the system
on the part of the user by way of speech or gestures etc.
whereas passive input may be understood as input
acquired from the user as a result of a monitoring
activity on the part of the system.
According to Landragin [12], multimodal fusion is
distinguished in terms of several sub-processes namely:
a) multimodal coordination, b) content fusion and c)
event fusion. Multimodal coordination associates two
activities acquired by different modalities for the
formation of a “complete utterance” [12]. The output of
this sub-process contains a set of paired “hypotheses”,
with an associated confidence level, which are ingested
by the multimodal content fusion sub-process to develop
an improved comprehension of otherwise partial
information [12]. The last sub-process, multimodal
event fusion, then unifies the “pragmatic forces of monomodal acts” to create a complete understanding of the
user’s input.
Landragin [12] uses the general
communication categories as considered in the study of
classical natural language, and lists them as being the
following: 1) inform, 2) demand, 3) question and 4) act.
Inform, demand and question are fairly easy to
understand as scenarios where the user may be providing
information to the system, requiring the system to do
something and querying the system to provide him/her
with some information. Act is the general category
which comes into play when the system is not able to
label the user’s input act into any of the aforementioned
three categories. [12]
5. RELEVANT WORK IN MULTI-MODAL
SYSTEMS
Corradini et al. [2] list a number of approaches and
architectures for multimodal fusion in multimodal
systems such as carrying out multimodal fusion in a
maximum likelihood estimation framework, using
distributed agent architectures (e.g. Open Agent
Architecture OAA [16]) with intra-agent communication
taking place through a blackboard, identification of
individuals via “physiological and/or behavioural
characteristics” e.g. biometric security systems using
fingerprints, iris, face, voice, hand shape etc. [2]. It is
stated by Corradini et al [2] that modality fusion in such
systems involve less complicated processing as they fall
largely under a “pattern recognition framework” [2] and
that this process may use techniques for integrating
“biometric traits” [2] such as the weighted sum rule as in
[19], Fisher discriminant analysis [19], decision trees
[18], decision fusion scheme [17] etc. [2]
Corradini et al [2] also list a number of systems
fusing speech and lip movements such as using
histograms [25], multivariate Gaussians [25], artificial
neural networks [24, 26] and hidden Markov models
[25].
According to [2], some systems use independent
individual modality processing modules such as speech
recognition modules, gesture recognition module, gaze
localisation etc. Each module carries out mono-modal
processing and presents the output to the multimodal
processing module which handles the semantic fusion.
These systems are ideal for introducing a shelf
framework where various showcases may be developed
for different application domains applying re-usable offthe-shelf components each handling a single modality in
full.
Some other systems listed by Corradini et al` [2]
include “quick set” [12] which offers the user the
freedom to interact with a map-based application using a
pen-and-speech cross-modal input capability. The
system presented in [27] enables the user to specify a
command by way of speech, pointing gesture and the
input from a graphical user interface into a “pipelined
architecture” [2]. The system put forward by Wahlster et
al [28] is a multimodal dialogue system which fuses
speech, facial expression and gesture for “both input and
output via an anthropomorphic and affective user
interface”, as Embodied Conversational Agent (ECA) or
Avatar [2].
In all such systems, input events are
assigned a weight or “confidence score” that is
considered during high level fusion for creating a set of
situation interpretations that are ranked by their
combined score [2].
6. CONCLUSION
The study of the research literature including
technical reports on various systems devised to offer
multimodal interactivity indicate that further research is
required in optimisation of multimodal human behaviour
interpretation and responsive multi-modal interactivity
support. Hurtig and Jokinen [4] found that multimodal
systems performance improves only for spatial domains
and it would be interesting to see how such systems
could be enhanced and scaled up for various domains
e.g. information-based domains etc. Both low-level and
high level fusion techniques could be more effective
and efficient than offered by present technology. There
exists scope for improvement in the way various
modalities are merged, including room for new
modalities to be introduced for enhancing semantic
fusion and interpretation.
7. REFERENCES
[1] K, Kraiss, “Advanced man-machine interaction:
fundamentals and implementation”, Springer, 2006,
pp 168 – 172.
[2] A. Corradini, M. Mehta, N.O. Bernsen, J.-C.
Martin, S. Abrilian, “Multimodal Input Fusion in
Human-Computer Interaction: On the Example of
the NICE Project”, Data Fusion for Situation
Monitoring, Incident Detection, Alert and Response
Management, E. Shahbazian et al. Ed. IOS Press,
2005
[3] D. Gibbon, I. Mertins, R. Moore, “Handbook of
Multimodal and Spoken Dialogue Systems:
Resources, Terminology and Product Evaluation”,
Springer, 2000
[4] T. Hurtig, K. Jokinen, “Modality Fusion in a Route
Navigation System”,
Workshop on Effective
Multimodal Dialogue Interface in International
Conference on Intelligent User Interfaces, Sydney,
Australia, 2006
[5] J.A. Jacko, A. Sears, “The Human-computer
Interaction Handbook: Fundamentals, Evolving
Technologies and Emerging Applications”, 2nd ed.
CRC, 2002
[6] J. Martina, S. Buisinea, G. Pitela, N.O. Bernsen,
“Fusion of children's speech and 2D gestures when
conversing with 3D characters”, ACM Signal
Processing, Vol. 86, Issue 12, Multimodal humancomputer interfaces, 2006, pp 3596 - 3624
[7] O. Vybornova, M. Gemo, B. Macq, “Multimodal
Multi-Level Fusion using Contextual Information”,
ERCIM News, Peter Kunz, Sophia-Antipolis
Cedex, France, 70, 61-62, 2007
[8] D. Racoceanu1, C. Lacoste1, R. Teodorescu1, N.
Vuillemenot, “A Semantic Fusion Approach
Between Medical Images and Reports Using
UMLS”, Lecture Notes in Computer Science,
Springer Berlin / Heidelberg, 2006
[9] D.A. Lambert, “Formal Theories for Semantic
Fusion Lambert”, 9th International Conference on
Information Fusion, pp 1 – 8, 2006
[10] A. Oermann , T. Scheidat , C. Vielhauer, J.
Dittmann, “Semantic Fusion for Biometric User
Authentication as Multimodal Signal Processing”,
Lecture Notes in Computer Science, Springer Berlin
/ Heidelberg, 2006
[11] T. Pham, N.E. Maillot, J. Lim, J. Chevallet, “Latent
Semantic Fusion Model for Image Retrieval and
Annotation”, Proceedings of the sixteenth ACM
conference on Conference on information and
knowledge management, Lisbon, Portugal, 2007, pp
439-444
[12] F. Landragin, “Physical, semantic and pragmatic
levels for multimodal fusion and fission”, Seventh
International
Workshop
on
Computational
Semantics (IWCS-7), Tilburg, The Netherlands,
2007
[13] R.A. Bolt. Put-that-there: “Voice and gesture at the
graphic interface”. Computer Graphics, 14(3):262270, 1980.
[14] Johnston, M., Cohen, P.R., McGee, D., Oviatt, S.,
Pittman, J. and Smith, I. “Unification-based
multimodal integration”. Procs of the 8th
conference on European chapter of the Association
for Computational Linguistics, 281-288, Madrid,
Spain, 1997.
[21] D. A. Lambert, “Situations for Situation
Awareness”,
Proceedings
of
the
Fourth
International Conference on International Fusion,
Montreal, 2001
[22] D. A. Lambert, “An Exegesis of Data Fusion”, Soft
Computing in Measurement and Information
Acquisition, Edited L. Reznik and V. Kreinovich.
Studies in Fuzziness and Soft Computing, Physica
Verlag, 2003
[15] Johnston, M. “Unification-based multimodal
parsing”. Procs of the 36th conference on
Association for Computational Linguistics, 624630, Montreal, Canada, 1998.
[23] D. A. Lambert, “Grand Challenges of Information
Fusion”, Proceedings of the Sixth International
Conference on Information Fusion. Cairns,
Australia. pp. 213 -219, 2003.
[16] Cheyer, A., and Martin, D., “The Open Agent
Architecture”, In: Journal of Autonomous Agents
and Multi-Agent Systems, vol. 4, no. 1, pp. 143148, March 2001
[24] Meier, U., Stiefelhagen, R., Yang, J., and Weibel,
A., “Towards Unrestricted Lip Reading”,
International Journal of Pattern Recognition and
Artificial Intelligence, vol. 14, no. 5, pp. 571-585,
2000
[17] Jain, A., Hong, L., and Kulkarni, Y., “A
Multimodal Biometric System Using Fingerprint,
Face and Speech”, Proceedings of 2nd Int'l
Conference on Audio- and Video-based Biometric
Person Authentication, Washington D.C., pp. 182187, 1999
[18] Ross, A., Jain, A., “Information Fusion in
biometrics”, Pattern Recognition Letters, vol. 24,
no. 13, pp. 2115-2121, 2003
[19] Wang, Y., Tan, T., Jain, A.K., “Combining Face
and Iris Biometrics for Identity Verification”,
Proceedings of 4th International Conference on
Audio- and Video-Based Biometric Person
Authentication, Guildford, UK, 2003
[20] D. A. Lambert, “Assessing Situations”, Proceedings
of 1999 Information, Decision and Control, pp. 503
- 508. IEEE, 1999.
[25] Nock, H. J., Iyengar, G., and Neti, C., “Assessing
Face and Speech Consistency for Monologue
Detection in Video”, Proceedings of ACM
Multimedia, Juan-les-Pins, France, 2002
[26] Wolff, G.J., Prasad, K.V., Stork D.G., and
Hennecke, M., “Lipreading by neural networks:
visual processing, learning and sensory integration”,
Proc. of Neural Information Proc. Sys. NIPS-6,
Cowan, J., Tesauro, G., and Alspector, J., eds., pp.
1027-1034, 1994
[27] Elting, C., Strube, M., Moehler, G., Rapp, S., and
Williams, J., “The Use of Multimodality within the
EMBASSI System”, M&C2002 - Usability
Engineering Multimodaler Interaktionsformen
Workshop, Hamburg, Germany, 2002
[28] Wahlster, W., Reithinger, N., and Blocher, A.,
SmartKom: “Multimodal Communication with a
Life-Like Character”, Proceedings of Eurospeech,
Aalborg, Denmark, 2001
Download