Notes on Multimodal Interfaces, Sharon Oviatt and suggestions for

advertisement

Notes on Multimodal Interfaces, Sharon Oviatt and suggestions for SKEME

Multimodal systems process two or more combined user input modes – such as speech, pen, touch, manual gestures, gaze, and head and body movements – in a coordinated manner with multimedia system output.

 The earliest “Put it there” interface allowed the user to say “create a blue square there”, and a blue square would be created where the mouse cursor points to.

This is similar to what we have in SKEME right now, we use the mouse cursor, and the object will be created where the user speaks to.

We can consider parsing words such as “here” and “there”, or “this” so that if the

One of the challenges of multimodal input is segmenting/interpreting continuous manual movements. user gets lazy and does not want to say a long expression, the handwriting recognition will “take over”. For example: y 

2

, may be spoken as

“fraction this [pause] over 2”.

( x x

2

)

We may have a similar challenge with SKEME, because the voice recognition comes in “phrases”, and the handwriting recognition comes in “strokes”. Maybe there can be an unobtrusive method for the user to alert the handwriting recognition that it is done with a word (like a double tap, lifting off the tablet for a set amount of time). This will help parse each symbol correctly.

There is only a marginal improvement in efficiency (10%) when a user uses multimodal input, but there is a 19-41% improvement in accuracy, according to one some studies (pp 6 – 8).

In the design of a multimodal system, a low-fidelity mockup should be used to construct tentative design plans a high-fidelity simulation should be followed to tweak the system.

Even without the working handwriting recognition right now (its only supported by tablet PC), it is feasible to construct a “testbed” that records the strokes as well as the recognized phrases, and plots the data in a time graph. This “testbed” may be useful to determine how the average user will speak and write mathematical symbols.

Humans issue multimodal commands 86 percent of the time when they have to add, move, modify, or calculate the distance between objects on the map.

Pen precedes speech 99 percent of the time, but the amount of time that it precedes speech is different for each person and application.

The “Testbed” can be used to determine the average time elapsed between pen input and speech input. Since the Microsoft speech SDK parses speech in phrases, this may not be as big of an issue as a speech parser that parses character by character .

Multimodal interfaces that process two or more recognition-based in put streams require time-stamping.

Multimodal input can be constructed using a multi-agent architecture, where each component is constructed separately and the results and related time-stamp information are routed to the multimodal integrator for further processing.

This describes the architecture we have currently. Ideally, the application running

Microsoft Speech SDK will recognize the sound and send the result to SKEME along with its time stamp. The tablet PC SDK will recognize the handwriting recognition, and SKEME will combine these two inputs.

Various research groups have independently converged on a strategy of recursively matching and merging attribute/value data structures. Several techniques include: frame based integration, unification-based integration, and hybrid symbolic/statistical based integration.

This document does not go into detail to describe the implementation of these types of integration. However, I will explore the reference sources cited.

Documents I would like to read:

(“Fusion” techniques with multimodal input)

Cohen, P. R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., & Clow, J.

(1997). Quickset: Multimodal interaction for distributed applications. Proceedings of the Fifth

ACM International Multimedia Conference, 31-40. New York: ACM Press.

Vo, M. T., & Wood, C. (1996). Building an application framework for speech and pen input integration in multimodal learning interfaces. Proceedings of the International Conference on

Acoustics Speech and Signal Processing (IEEE-ICASSP) Vol.6, 3545-3548 . IEEE Press.

Carpenter, R. (1990). Typed feature structures: Inheritance, (in)equality, and extensionality.

Proceedings of the ITK Workshop: Inheritance in Natural Language Processing, 9-18. Tilburg:

Institute for Language Technology andArtificial Intelligence, Tilburg University.

Carpenter, R. (1992). The logic of typed feature structures. Cambridge, U. K.: Cambridge

University Press.

Johnston, M., Cohen, P. R., McGee, D., Oviatt, S. L., Pittman, J. A., & Smith, I. (1997).

Unification-based multimodal integration. Proceedings of the 35 th

Annual Meeting of the

Association for Computational Linguistics, 281-288. San Francisco: Morgan Kaufmann.

Wu, L., Oviatt, S., &. Cohen, P. (1999). Multimodal integration – A statistical view. IEEE

Transactions on Multimedia, 1(4), 334-341.

Other Documents:

Oviatt, S. L., Bernard, J., & Levow, G. (1999). Linguistic adaptation during error resolution with spoken and multimodal systems. Language and Speech, 41(3-4), 415-438 (special issue on

"Prosody and Speech").

Oviatt, S. L., & van Gent, R. (1996). Error resolution during multimodal human-computer interaction. Proceedings of the International Conference on Spoken Language Processing, 2, 204-

207. University of Delaware Press.

Download