MULTI-LEVEL INPUT FUSION IN MULTI-MODAL SYSTEMS A.A. Khan, A. Badii Intelligent Media Systems and Services Research Laboratory (IMSS) School of Systems Engineering University of Reading, UK {a.a.khan, atta.badii}@reading.ac.uk ABSTRACT Context-aware multimodal interactive systems aim to adapt to the needs and behavioural patterns of users and offer a way forward for enhancing the efficacy and quality of experience (QoE) in human-computer interaction. The various modalities that constribute to such systems each provide a specific uni-modal response that is integratively presented as a multi-modal interface capable of interpretation of multi-modal user input and appropriately responding to it through dynamically adapted multi-modal interactive flow management , This paper presents an initial background study in the context of the first phase of a PhD research programme in the area of optimisation of data fusion techniques to serve multimodal interactivite systems, their applications and requirements. 1. INTRODUCTION The advent of widespread commercial availability of technology has tended to shift the human-computer interaction paradigm from a traditional PC (keyboard and mouse input) based usage to a more natural and flexible manner in which users may interact with the system. Such systems comprise of a number of modalities by which they may acquire user input to perform a function requested by the user. The collection of input from various modalities enables the development of intelligent systems that are contextaware and aids in the shift from traditional systems to a more natural approach for interaction and usability [4]. For example a mix of gestures and speech can provide a more user-friendly manner in which users may interact with the system. 2. MULTI-MODAL SYSTEMS Multimodal systems provide increased accuracy and precision (and in turn improved reliability) in terms of context-awareness and situation assessment by incorporating information from a number of input modalities. This approach offers a kind of fault-tolerant way of managing modalities. In the case that one of the modalities fails or contains noisy data, information from other modalities can be used to minimise ambiguity re situation assessment arising from a failed or noisy modality. The improvement in more reliable contextsensing and thus more appropriately responsive behaviour by the interactive system is said to be a likely outcome of multimodal fusion [2]. Various application areas exist for such multimodal systems that can make use of feasible hardware devices to acquire input not only at a neuro-motor but also at a physiological level for instance: monitoring heart rate, perspiration, blood pressure etc. in a patient monitoring system. Further examples of multimodal systems include anomaly detection, assisted living, PDA with pen-gestures, speech and graphics input capability etc. 2.1. Situation Assessment According to Lambert [9, 20, 21, 22, 23], situation assessment involves assessment of situations where situations are “fragments of the world represented by a set of assertions”. This differs from object recognition in the sense that it requires a shift in the procedure from numeric to symbolic representations, a problem coined as the semantic challenge for information fusion in [23]. Lambert [9] proposes an interesting approach to semantic fusion using a formal theory by following a development path that involves sequential construction of the problem in terms of philosophy, mathematics and computation. This approach is illustrated using a formal theory for existence in Lambert’s work [9]. 3. INPUT FUSION There exist two main approaches in literature for integration and fusion of multimodal input according to Corradini et al [2] namely: Integration of signal at feature level, Integration of information at semantic level. The feature level signal fusion, also referred to lower level fusion is related to “closely coupled and synchronised” [2] modalities such as speech and lip movements. This does not scale with ease, requires extensive training data sets and incurs high computational costs [2]. Higher level symbolic or semantic fusion on the other hand, is related to modalities that “differ in time scale characteristics of their features” [2]. This entails time stamping of all modalities to aid the fusion process. Semantic fusion involves a number of benefits such as off-the-shelf usage, reusability, simplicity etc. [2]. Semantic fusion is a process that unifies input at a “meaning level” from various modalities that are part of a multimodal system [3]. It is said to occur in two steps: a) input events for a user’s command are taken from various modalities and fused at a low level to form a single multimodal input event that signifies the user’s command; b) this multimodal input event is then semantically interpreted at a higher level to deduce the intended meaning of the user’s action by “extracting and combining information chunks” [3]. 4. MULTI-MODAL FUSION The main aim of multimodal semantic input fusion is to make sense of the user’s intended meaning by fusing the partial streams of input that constitute a user’s command from various input modalities of the system. Hurtig and Jokinen [4] cite a “classical example of coordinated multimodal input” as the Put-that-there system proposed by Bolt [13]. This system presented a scenario where the user of a system specified input commands in a natural manner by way of speech and hand gestures. According to [4], multimodal input fusion can be categorised as a process that occurs over three levels: i) signal-level, ii) feature-level and iii) semantic, where semantic fusion integrates the understanding acquired from inputs of various individual modalities into a single comprehension set that signifies the intended meaning of the user’s input. Hurtig and Jokinen [4] propose a three level semantic fusion component which involves temporal fusion that creates combinations of input data, statistically motivated weighting of these combinations, and, a discourse level phase that selects the best candidate. The weighting process uses three parameters namely overlap, proximity and concept type of the multimodal input events [4]. Overlapping and proximity are related to temporal placement of the input events where proximity is said to play a very important role especially in modalities such as speech and tactile data [4]. Concept type refers to user commands which would be incomplete if the system were to consider only one input modality e.g. a user pointing to a location and speaking a phrase that only incompletely describes the location [4]. Once the weighted ranking has been performed, a list of these candidates is passed on to the discourse level phase that selects the best ranked final candidate for system response construction. If no candidate fits, the system requests the user to repeat his/her command [4]. Multimodal fusion encapsulates the union of a number of data from a number of input channels in a multimodal interactive system [12] where each channel may represent a distinct modality whereby a user may interact actively or passively with the system. Active input can be categorised as a direct usage of the system on the part of the user by way of speech or gestures etc. whereas passive input may be understood as input acquired from the user as a result of a monitoring activity on the part of the system. According to Landragin [12], multimodal fusion is distinguished in terms of several sub-processes namely: a) multimodal coordination, b) content fusion and c) event fusion. Multimodal coordination associates two activities acquired by different modalities for the formation of a “complete utterance” [12]. The output of this sub-process contains a set of paired “hypotheses”, with an associated confidence level, which are ingested by the multimodal content fusion sub-process to develop an improved comprehension of otherwise partial information [12]. The last sub-process, multimodal event fusion, then unifies the “pragmatic forces of monomodal acts” to create a complete understanding of the user’s input. Landragin [12] uses the general communication categories as considered in the study of classical natural language, and lists them as being the following: 1) inform, 2) demand, 3) question and 4) act. Inform, demand and question are fairly easy to understand as scenarios where the user may be providing information to the system, requiring the system to do something and querying the system to provide him/her with some information. Act is the general category which comes into play when the system is not able to label the user’s input act into any of the aforementioned three categories. [12] 5. RELEVANT WORK IN MULTI-MODAL SYSTEMS Corradini et al. [2] list a number of approaches and architectures for multimodal fusion in multimodal systems such as carrying out multimodal fusion in a maximum likelihood estimation framework, using distributed agent architectures (e.g. Open Agent Architecture OAA [16]) with intra-agent communication taking place through a blackboard, identification of individuals via “physiological and/or behavioural characteristics” e.g. biometric security systems using fingerprints, iris, face, voice, hand shape etc. [2]. It is stated by Corradini et al [2] that modality fusion in such systems involve less complicated processing as they fall largely under a “pattern recognition framework” [2] and that this process may use techniques for integrating “biometric traits” [2] such as the weighted sum rule as in [19], Fisher discriminant analysis [19], decision trees [18], decision fusion scheme [17] etc. [2] Corradini et al [2] also list a number of systems fusing speech and lip movements such as using histograms [25], multivariate Gaussians [25], artificial neural networks [24, 26] and hidden Markov models [25]. According to [2], some systems use independent individual modality processing modules such as speech recognition modules, gesture recognition module, gaze localisation etc. Each module carries out mono-modal processing and presents the output to the multimodal processing module which handles the semantic fusion. These systems are ideal for introducing a shelf framework where various showcases may be developed for different application domains applying re-usable offthe-shelf components each handling a single modality in full. Some other systems listed by Corradini et al` [2] include “quick set” [12] which offers the user the freedom to interact with a map-based application using a pen-and-speech cross-modal input capability. The system presented in [27] enables the user to specify a command by way of speech, pointing gesture and the input from a graphical user interface into a “pipelined architecture” [2]. The system put forward by Wahlster et al [28] is a multimodal dialogue system which fuses speech, facial expression and gesture for “both input and output via an anthropomorphic and affective user interface”, as Embodied Conversational Agent (ECA) or Avatar [2]. In all such systems, input events are assigned a weight or “confidence score” that is considered during high level fusion for creating a set of situation interpretations that are ranked by their combined score [2]. 6. CONCLUSION The study of the research literature including technical reports on various systems devised to offer multimodal interactivity indicate that further research is required in optimisation of multimodal human behaviour interpretation and responsive multi-modal interactivity support. Hurtig and Jokinen [4] found that multimodal systems performance improves only for spatial domains and it would be interesting to see how such systems could be enhanced and scaled up for various domains e.g. information-based domains etc. Both low-level and high level fusion techniques could be more effective and efficient than offered by present technology. There exists scope for improvement in the way various modalities are merged, including room for new modalities to be introduced for enhancing semantic fusion and interpretation. 7. REFERENCES [1] K, Kraiss, “Advanced man-machine interaction: fundamentals and implementation”, Springer, 2006, pp 168 – 172. [2] A. Corradini, M. Mehta, N.O. Bernsen, J.-C. Martin, S. Abrilian, “Multimodal Input Fusion in Human-Computer Interaction: On the Example of the NICE Project”, Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management, E. Shahbazian et al. Ed. IOS Press, 2005 [3] D. Gibbon, I. Mertins, R. Moore, “Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation”, Springer, 2000 [4] T. Hurtig, K. Jokinen, “Modality Fusion in a Route Navigation System”, Workshop on Effective Multimodal Dialogue Interface in International Conference on Intelligent User Interfaces, Sydney, Australia, 2006 [5] J.A. Jacko, A. Sears, “The Human-computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications”, 2nd ed. CRC, 2002 [6] J. Martina, S. Buisinea, G. Pitela, N.O. Bernsen, “Fusion of children's speech and 2D gestures when conversing with 3D characters”, ACM Signal Processing, Vol. 86, Issue 12, Multimodal humancomputer interfaces, 2006, pp 3596 - 3624 [7] O. Vybornova, M. Gemo, B. Macq, “Multimodal Multi-Level Fusion using Contextual Information”, ERCIM News, Peter Kunz, Sophia-Antipolis Cedex, France, 70, 61-62, 2007 [8] D. Racoceanu1, C. Lacoste1, R. Teodorescu1, N. Vuillemenot, “A Semantic Fusion Approach Between Medical Images and Reports Using UMLS”, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006 [9] D.A. Lambert, “Formal Theories for Semantic Fusion Lambert”, 9th International Conference on Information Fusion, pp 1 – 8, 2006 [10] A. Oermann , T. Scheidat , C. Vielhauer, J. Dittmann, “Semantic Fusion for Biometric User Authentication as Multimodal Signal Processing”, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2006 [11] T. Pham, N.E. Maillot, J. Lim, J. Chevallet, “Latent Semantic Fusion Model for Image Retrieval and Annotation”, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, Lisbon, Portugal, 2007, pp 439-444 [12] F. Landragin, “Physical, semantic and pragmatic levels for multimodal fusion and fission”, Seventh International Workshop on Computational Semantics (IWCS-7), Tilburg, The Netherlands, 2007 [13] R.A. Bolt. Put-that-there: “Voice and gesture at the graphic interface”. Computer Graphics, 14(3):262270, 1980. [14] Johnston, M., Cohen, P.R., McGee, D., Oviatt, S., Pittman, J. and Smith, I. “Unification-based multimodal integration”. Procs of the 8th conference on European chapter of the Association for Computational Linguistics, 281-288, Madrid, Spain, 1997. [21] D. A. Lambert, “Situations for Situation Awareness”, Proceedings of the Fourth International Conference on International Fusion, Montreal, 2001 [22] D. A. Lambert, “An Exegesis of Data Fusion”, Soft Computing in Measurement and Information Acquisition, Edited L. Reznik and V. Kreinovich. Studies in Fuzziness and Soft Computing, Physica Verlag, 2003 [15] Johnston, M. “Unification-based multimodal parsing”. Procs of the 36th conference on Association for Computational Linguistics, 624630, Montreal, Canada, 1998. [23] D. A. Lambert, “Grand Challenges of Information Fusion”, Proceedings of the Sixth International Conference on Information Fusion. Cairns, Australia. pp. 213 -219, 2003. [16] Cheyer, A., and Martin, D., “The Open Agent Architecture”, In: Journal of Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1, pp. 143148, March 2001 [24] Meier, U., Stiefelhagen, R., Yang, J., and Weibel, A., “Towards Unrestricted Lip Reading”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 14, no. 5, pp. 571-585, 2000 [17] Jain, A., Hong, L., and Kulkarni, Y., “A Multimodal Biometric System Using Fingerprint, Face and Speech”, Proceedings of 2nd Int'l Conference on Audio- and Video-based Biometric Person Authentication, Washington D.C., pp. 182187, 1999 [18] Ross, A., Jain, A., “Information Fusion in biometrics”, Pattern Recognition Letters, vol. 24, no. 13, pp. 2115-2121, 2003 [19] Wang, Y., Tan, T., Jain, A.K., “Combining Face and Iris Biometrics for Identity Verification”, Proceedings of 4th International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, 2003 [20] D. A. Lambert, “Assessing Situations”, Proceedings of 1999 Information, Decision and Control, pp. 503 - 508. IEEE, 1999. [25] Nock, H. J., Iyengar, G., and Neti, C., “Assessing Face and Speech Consistency for Monologue Detection in Video”, Proceedings of ACM Multimedia, Juan-les-Pins, France, 2002 [26] Wolff, G.J., Prasad, K.V., Stork D.G., and Hennecke, M., “Lipreading by neural networks: visual processing, learning and sensory integration”, Proc. of Neural Information Proc. Sys. NIPS-6, Cowan, J., Tesauro, G., and Alspector, J., eds., pp. 1027-1034, 1994 [27] Elting, C., Strube, M., Moehler, G., Rapp, S., and Williams, J., “The Use of Multimodality within the EMBASSI System”, M&C2002 - Usability Engineering Multimodaler Interaktionsformen Workshop, Hamburg, Germany, 2002 [28] Wahlster, W., Reithinger, N., and Blocher, A., SmartKom: “Multimodal Communication with a Life-Like Character”, Proceedings of Eurospeech, Aalborg, Denmark, 2001