1
Component Description
Multimodal Interface
Carnegie Mellon University
Prepared by: Michael Bett mbett@cs.cmu.edu
3/26/99
2
Description of the Multimodal Toolkit (MMI)
What MMI is ...
Integrated Speech, Handwriting, and Gesture
Recognizer Java Based API
Integrated Recording Feature
Plug-n-Play Recognizer Interface. Allows recognizers to be replaced
Internet Enabled Interface. Recognizers may run remotely over the internet
Simultaneous Multiple User Support
Supports Natural Interface Development
•MMI is a toolkit that allows multiple modalities to be easily integrated into applications.
•Applications can mixed modalites
(speech, gesture, and handwriting)
Multimodal
Applet
Speech
Janus/Speech
Recognizer
Multimodal
Server
Handwriting
Handwriting
Recognizer
Gestures
Gesture
Recognizer
3
Sample Application Which Uses Multimodal
Error Repair
Acoustic
Model
Vocabulary
The Java based API communicates directly with each recognizer
Language
Model
The multimodal applet is the user interface; the applet window presents a view onto a domaindependent representation of application data and state in the form of objects to be manipulated .
4
The following modalites have the following level of support in multimodal toolkit
Speech
Handwriting
Pen gestures
3-D gestures
Lip-reading
Gaze tracking
Keyboard
Mouse
Facial expressions
Type of task
Data entry
Command
Experimental
Experimental
Experimental
Experimental
Table 1. Supported Applications
= strongly supported;
= supported; ?
= not precluded
5
The user defines their grammer using six probabilistically weighted nodes:
A Toplevel represents an entire input model and contains one or more sequences , each of which contains exactly one AFrame;
An AFrame represents an action frame and contains one or more sequences, each of which consists of one or more PSlots;
A PSlot represents a parameter slot and contains one or more
UnimodalNodes (at most one for each input modality);
A UnimodalNode specifies a sub-grammar for a single input modality and has the same structure as a NonTerm, with the addition of a label specifying the modality;
A NonTerm is a non-terminal node consisting of one or more sequences, each of which contains zero or more NonTerms or Literals;
A Literal is a terminal node containing a text string representing one or more input tokens.
6
The Multimodal Server sends a series of points to the pen and gesture recognizers.
The audio is sent to the speech recognizer.
The pen, gesture and speech recognizers return their hypothesis to the multimodal toolkit which is responsible for integrating the results in an optimizing programming search as shown below. [Minh Tue Voh Dissertation 1998 CMU]
Query
Distance
Dst
Query
Distance
Src arrow_end arrow_start
PEN how far is it from here to there
SPEECH
.
Output Path Over Multidimensional Inputs
7
The multimodal toolkit uses a Java API which allows applets or applications to incorporate multimodal functionality
8
Part 1 - Specify how other CPOF components can send and receive data to your system - Please be explicit
Components may directly interface with the multimodal server
Part 2 - What are the inputs to your system - Please specify formats and protocol - provide details
Multimodal grammar
Part 3 - What are the outputs of your system -
Please specify format and protocol - provide details
Hypothesis according to the multimodal grammer
9
Part 1 - We have not currently identified how our components interact with other CPOF components.
Please present a diagram that shows this interaction TBD
Part 2 - Are there components in your system that are functionally “similar” to another CPOF component? TBD
Part 3 - Are any of your components complementing other CPOF components? (e.g ZUI and
Sage/Visage) TBD
10
Component
Name
Multimodal
Server
Janus
NPen++
Gesture
Recognizer
Required
Hardware
PC or Sun
Operating
System
Independent
Sun - Ultra 60 Solaris 2.5.1
Sun or PC
Sun or PC
Solaris 2.5.1 or
Windows NT
Solaris 2.5.1 or
Windows NT
Language
Java
Tcl/tk
C
C++
C++
Required
COTS
JDK 1.1.*
Tcl/Tk
None
None
11
Specify the hardware required to support your system:
MMI can run on a PC with a minimum of 32 Meg RAM and
200 Mhz processor.
The Speech Recognizer requires a Sun Ultra 60 dual processor with 500 Meg RAM minimum. (Current recognizer under development will require 500 Mhz Pentium III with a
128 Meg minimum, 256 Meg preferred)
Video capture cards, Soundblaster compatitable sound cards, table top and lapel microphones, pan tilt and stationary cameras are required.