Component Description Multimodal Interface Carnegie Mellon University Prepared by: Michael Bett

advertisement

1

Component Description

Multimodal Interface

Carnegie Mellon University

Prepared by: Michael Bett mbett@cs.cmu.edu

3/26/99

2

1 - Overview

 Description of the Multimodal Toolkit (MMI)

What MMI is ...

Integrated Speech, Handwriting, and Gesture

Recognizer Java Based API

Integrated Recording Feature

Plug-n-Play Recognizer Interface. Allows recognizers to be replaced

Internet Enabled Interface. Recognizers may run remotely over the internet

Simultaneous Multiple User Support

Supports Natural Interface Development

2 - Architecture Overview

•MMI is a toolkit that allows multiple modalities to be easily integrated into applications.

•Applications can mixed modalites

(speech, gesture, and handwriting)

Multimodal

Applet

Speech

Janus/Speech

Recognizer

Multimodal

Server

Handwriting

Handwriting

Recognizer

Gestures

Gesture

Recognizer

3

Sample Application Which Uses Multimodal

Error Repair

Acoustic

Model

Vocabulary

The Java based API communicates directly with each recognizer

Language

Model

The multimodal applet is the user interface; the applet window presents a view onto a domaindependent representation of application data and state in the form of objects to be manipulated .

4

3 - Component Description

The following modalites have the following level of support in multimodal toolkit

Speech

Handwriting

Pen gestures

3-D gestures

Lip-reading

Gaze tracking

Keyboard

Mouse

Facial expressions

Type of task

Data entry

Command

Experimental

Experimental

Experimental

Experimental

Table 1. Supported Applications

= strongly supported;

= supported; ?

= not precluded

5

4 - External Interfaces

 The user defines their grammer using six probabilistically weighted nodes:

A Toplevel represents an entire input model and contains one or more sequences , each of which contains exactly one AFrame;

An AFrame represents an action frame and contains one or more sequences, each of which consists of one or more PSlots;

A PSlot represents a parameter slot and contains one or more

UnimodalNodes (at most one for each input modality);

A UnimodalNode specifies a sub-grammar for a single input modality and has the same structure as a NonTerm, with the addition of a label specifying the modality;

A NonTerm is a non-terminal node consisting of one or more sequences, each of which contains zero or more NonTerms or Literals;

A Literal is a terminal node containing a text string representing one or more input tokens.

6

4 - External Interfaces

 The Multimodal Server sends a series of points to the pen and gesture recognizers.

 The audio is sent to the speech recognizer.

 The pen, gesture and speech recognizers return their hypothesis to the multimodal toolkit which is responsible for integrating the results in an optimizing programming search as shown below. [Minh Tue Voh Dissertation 1998 CMU]

Query

Distance

Dst

Query

Distance

Src arrow_end arrow_start

PEN how far is it from here to there

SPEECH

.

Output Path Over Multidimensional Inputs

7

5 Existing Software “Bridges”

 The multimodal toolkit uses a Java API which allows applets or applications to incorporate multimodal functionality

8

6 - Information Flow

 Part 1 - Specify how other CPOF components can send and receive data to your system - Please be explicit

 Components may directly interface with the multimodal server

 Part 2 - What are the inputs to your system - Please specify formats and protocol - provide details

 Multimodal grammar

 Part 3 - What are the outputs of your system -

Please specify format and protocol - provide details

 Hypothesis according to the multimodal grammer

9

7 - Plug-n-play

 Part 1 - We have not currently identified how our components interact with other CPOF components.

 Please present a diagram that shows this interaction TBD

 Part 2 - Are there components in your system that are functionally “similar” to another CPOF component? TBD

 Part 3 - Are any of your components complementing other CPOF components? (e.g ZUI and

Sage/Visage) TBD

10

8 - Operating Environments and COTS

Component

Name

Multimodal

Server

Janus

NPen++

Gesture

Recognizer

Required

Hardware

PC or Sun

Operating

System

Independent

Sun - Ultra 60 Solaris 2.5.1

Sun or PC

Sun or PC

Solaris 2.5.1 or

Windows NT

Solaris 2.5.1 or

Windows NT

Language

Java

Tcl/tk

C

C++

C++

Required

COTS

JDK 1.1.*

Tcl/Tk

None

None

11

9 - Hardware Platform Requirement

 Specify the hardware required to support your system:

MMI can run on a PC with a minimum of 32 Meg RAM and

200 Mhz processor.

The Speech Recognizer requires a Sun Ultra 60 dual processor with 500 Meg RAM minimum. (Current recognizer under development will require 500 Mhz Pentium III with a

128 Meg minimum, 256 Meg preferred)

Video capture cards, Soundblaster compatitable sound cards, table top and lapel microphones, pan tilt and stationary cameras are required.

Download