presentation source

Dagstuhl 2000

Pervasive Speech and

Language Technology

Wolfgang Wahlster

German Research Center for

Artificial Intelligence, DFKI GmbH

Stuhlsatzenhausweg 3

66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de

WWW:http://www.dfki.de/~wahlster

Pervasive Speech and Language Technology

A capuccino in

10 minutes, please!

Speech-controlled coffee machine

Let‘s go to Baker

Street in Berkeley!

Speech-based car navigation

I would like to hear

Mozart‘s piano concert

No. 3!

Send the following email to

Mark Maybury: Hi Mark, please forward the following agenda to your project partners!

© Wolfgang Wahlster, DFKI

Speech-enabled music selection

Dictation

Dagstuhl 2000

Pervasive Speech and Language Technology

Show me all CNN news of the last 3 months that feature Bill

Clinton discussing health care!

Information on demand

What has Jim Hendler said about DAML during our recent Dagstuhl seminar?

Audio Mining

I would like to make an appointment with

Dr. Kuremastu in Kyoto next week!


Speech-to-Speech

Translation

Dagstuhl 2000

Three Levels of Language Processing

Speech Input

Acoustic

Language Models

Speech Recognition

Word Lists

What has the speaker said?

100

Alternatives

Sprachanalyse

Speech Analysis

Grammar

Lexical

Meaning

Speech

Understanding

What has the speaker meant?

10

Alternatives

Discourse Context

Knowledge about Domain of Discourse


What does the speaker want?

Unambiguous

Understanding in the

Dialog Context

Dagstuhl 2000

Challenges for Language Engineering

Input Conditions Naturalness Adaptability Dialog Capabilities

Close-Speaking

Microphone/Headset

Push-to-talk

Isolated Words Speaker

Dependent

Monolog

Dictation

Informationseeking Dialog

Telephone,

Pause-based

Segmentation

Read Continuous

Speech

Speaker

Independent

Open Microphone,

GSM Quality

Spontaneous

Speech

Speaker adaptive

Verbmobil


Multiparty

Negotiation

Dagstuhl 2000

Context-Sensitive Speech-to-Speech Translation

Wann fährt der nächste

Zug nach Hamburg ab?

Wo befindet sich das nächste

Hotel?

When does the next train to Hamburg depart?

Where is the nearest hotel?

Verbmobil

Server

© Wolfgang Wahlster, DFKI Dagstuhl 2000

Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

As the name Verb mobil suggests, the system supports verb al communication with foreign dialog partners in mobil e situations.

1 face-to-face conversations

2 telecommunication


Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

Verbmobil Speech

Translation Server

Solution: Conference Call: The Verbmobil Speech Translation Server is accessed by GSM mobile phones.



Speech-to-Speech Translation

Dagstuhl 2000

The Control Panel of Verbmobil


General Speech Recognition Task

Audio Signal Recognizers

German

English

Japanese

Word Hypotheses Graph


Word Hypotheses Graphs (WHGs)

WHGs realize the interface between acoustic and linguistic processing

Edge = Word

Best Hypothesis

Acoustic Score


Massive Data Collection Efforts

Transliteration Variant 1

Transliteration Variant 2

Lexical Orthography

Canonical Pronounciation

Manual Phonological Segmentation

Automatic Phonological Segmentation

Word Segmentation

Prosodic Segmentation

Dialog Acts

Noises

Superimposed Speech

Syntactic Category

Word Category

Syntactic Function

Prosodic Boundaries

The so-called Partitur (German word for musical score) orchestrates fifteen strata of annotations


3,200 dialogs (182 hours) with 1,658 speakers

79,562 turns distributed on

56 CDs, 21.5 GB

Dagstuhl 2000

Extracting Statistical Properties from Large Corpora

Transcribed

Speech Data

Segmented

Speech with Prosodic

Labels

Annotated

Dialogs with

Dialog Acts

Treebanks &

Predicate-

Argument

Structures

Aligned

Bilingual

Corpora

Machine Learning for the Integration of Statistical Properties into

Symbolic Models for Speech Recognition, Parsing,

Dialog Processing, Translation

Hidden

Markov

Models


Neural Nets,

Multilayered

Perceptrons

Probabilistic

Automata

Probabilistic

Grammars

Probabilistic

Transfer

Rules

Dagstuhl 2000

From Multi-Agent Architectures to a Multi-

Blackboard Architectures



Multi-Agent Architecture

M3

M1 M2

M4 M5

M6

 Each module must know, which module produces what data



Direct communication between modules



Each module has only one instance

 Heavy data traffic for moving copies around

 Multiparty and telecooperation applications are impossible



Software: ICE and ICE Master

 Basic Platform: PVM



Multi-Blackboard Architecture

M1 M2 M3

BB 1 BB 2 BB 3

Blackboards

M4 M5 M6

 All modules can register for each blackboard dynamically



No direct communication between modules



Each module can have several instances

 No copies of representation structures

(word lattice, VIT chart)

 Multiparty and Telecooperation applications are possible



Software: PCA and Module Manager

 Basic Platform: PVM


A Multi-Blackboard Architecture for the Combination of Results from Deep and Shallow Processing Modules

Command

Recognizer

Audio Data

Spontaneous

Speech Recognizer

Channel/Speaker

Adaptation

Prosodic

Analysis

Statistical

Parser

Chunk

Parser

Dialog Act

Recognition

Word Hypotheses

Graph with

Prosodic Labels HPSG

Parser

Semantic

Construction

Robust Dialog

Semantics

VITs

Underspecified

Discourse

Representations

Semantic

Transfer

Generation


The Use of Prosodic Information at All Processing Stages

Speech Signal Word Hypotheses Graph

Multilingual Prosody Module

Prosodic features:

 duration

 pitch

 energy

 pause

Accented

Words

Boundary

Information

Boundary

Information

Sentence

Mood

Search Space

Restriction

Parsing


Dialog Act

Segmentation and

Recognition

Dialog

Understanding

Constraints for

Transfer

Translation

Lexical

Choice

Generation

Prosodic Feature

Vector

Speaker

Adaptation

Speech

Synthesis

Dagstuhl 2000

Competing Strategies for Robust Speech

Translation

Concurrent processing modules combine deep semantic translation with shallow surface-oriented translation methods.

Expensive, but precise Translation

 Principled and compositional syntactic and semantic analysis

 Semantic-based transfer of

Verbmobil Interface Terms (VITs) as set of underspecified DRS

Results with

Confidence Values

Word Lattice time out?

Cheap, but approximate Translation

 Case-based Translation

 Dialog-act based translation

 Statistical translation

Selection of best result

Results with

Confidence Values

Acceptable Translation Rate


Integrating Shallow and Deep Analysis

Components in a Multi-Blackboard Architecture

Augmented

Word Hypotheses

Graph

Statistical Parser partial VITs

Chunk Parser partial VITs

Chart with a combination of partial VITs

Robust Dialog Semantics

Combination and knowledgebased reconstruction of complete VITs

Complete and Spanning

VITs


HPSG Parser partial VITs

Dagstuhl 2000

VHG: A Packed Chart Representation of Partial

Semantic Representations

 Incremental chart construction and anytime processing

 Rule-based combination and transformation of partial UDRS coded as VITs

 Selection of a spanning analysis using a bigram model for VITs

(trained on a tree bank of 24 k VITs)

 Chart Parser using cascaded finite-state transducers

 Statistical LR parser trained on treebank

 Very fast HPSG parser


Semantic

Construction

Dagstuhl 2000

The Understanding of Spontaneous Speech Repairs

Original Utterance

I need a car next Tuesday

Reparandum

Editing Phase Repair Phase oops

Hesitation

Monday

Reparans

Recognition of

Substitutions

Transformation of the

Word Hypothesis Graph

I need a car next Monday

Verbmobil Technology: Understands Speech Repairs and extracts the intended meaning

Dictation Systems like: ViaVoice, VoiceXpress, FreeSpeech, Naturally Speaking cannot deal with spontaneous speech and transcribe the corrupted utterances.


Automatic Understanding and Correction of Speech

Repairs in Spontaneous Telephone Dialogs

Wir treffen uns in

Mannheim, äh, in Saarbrücken.

(We are meeting in

Mannheim, oops, in Saarbruecken.)

German


English

We are meeting in Saarbruecken.

Dagstuhl 2000

Robust Dialog Semantics: Combining and

Completing Partial Representations

Let us meet (in) the late afternoon to catch the train to Frankfurt

Let us meet the late afternoon to catch the train to Frankfurt

The preposition ‚in‘ is missing in all paths through the word hypotheses graph.

A temporal NP is transformed into a temporal modifier using a underspecified temporal relation:

[temporal_np(V1)]



[typeraise_to_mod (V1, V2)] & V2

The modifier is applied to a proposition:

[type (V1, prop), type (V2, mod)]



[apply (V2, V1, V3)] & V3


Integrating Deep and Shallow Processing: Combining

Results from Concurrent Translation Threads

Segment 1

If you prefer another hotel,

Segment 2 please let me know.

Statistical

Translation

Case-Based

Translation

Dialog-Act Based

Translation

Alternative Translations with Confidence Values

Selection Module

Semantic

Transfer

Segment 1

Translated by Semantic Transfer


Segment 2

Translated by Case-Based Translation

Dagstuhl 2000

Unit Selection Algorithm

Sentence to synthesize

I have time on

I

I

I I

I have have time

Edge direction on on on monday.

monday monday


Linguatronic : Spoken Dialogs with Mercedes-Benz

Please call Doris Wahlster.

Microphone

Push-to-talk

Switch

Open the left window in the back.

I want to hear the weather channel.

When will I reach the next gas station?

Where is the next parking lot?

 Speech control of: cellular phone, radio, windows / AC, route guidance system

 Option for S-, C-, and E-Class of Mercedes and BMW

 Speaker-independent, Garbage models for non-speech (blinker, AC, wheels)


International Research Trends in Multilingual Systems

Multilingual Language Technology

Speech Recognition, Language Understanding, Language Generation, and Speech Synthesis

Dialog Translation

 Call Centers

 ECommerce

 Mobile Travel

Assistance

 Telephone

Translations

Verbmobil

Multilingual

Indexing and

Annotation of

Videos

 Video Archives

 News Archives

Multilingual

Audio Retrieval and Audio Mining

 Discussions

 Lecture Notes

 Organizers

Speech-based

Web Access to Multilingual

Web pages

 WAP Phones

 WebTV

Multilingual and Mobile

Communication

Assistants

 Multimodal

Interfaces

SmartKom

Spontaneous Speech, Robust Processing and Translation, Semantic and Pragmatic Understanding


Conclusion I

 Real-world problems in language technology like the understanding of spoken dialogs, speech-to-speech translation and multimodal dialog systems can only be cracked by the combined muscle of deep and shallow processing approaches .



In a multi-blackboard architecture based on packed representations on all processing levels (speech recognition, parsing, semantic processing, translation, generation) using charts with underspecified representations (eg. UDRS) the results of concurrent processing threads can be combined in an incremental fashion.


Conclusion II



All results of concurrent processing modules should come with a confidence value , so that a selection module can choose the most promising result at a each processing stage.



Packed representations together with formalisms for underspecification capture the uncertainties in a each processing phase, so that the uncertainties can be reduced by linguistic, discourse and domain constraints as soon as they become applicable.


Conclusion III

 Deep Processing can be used for merging, completing and repairing the results of shallow processing strategies.



Shallow methods can be used to guide the search in deep processing.



Statistical methods must be augmented by symbolic models (eg. Class-based language modelling, word order normalization as part of statistical translation).



Statistical methods can be used to learn operators or selection strategies for symbolic processes.

It is much more than a balancing act...

(see Klavans and Resnik 1996)


Open Problems for the Next Decade

 Problems with current machine learning approaches

L Expensive data collection

L Cognitively unrealistic training data

L Data sparseness

 Problems with current hand-crafted knowledge sources

L Brittleness

L Domain dependence

L Limited scalability


A Speculative Conclusion (+50 years)

-500 years TODAY +50 years

Oral Society  Textual Society  Oral Society

News and knowledge is passed orally

No mass storage

No automatic processing

No automatic retrieval

News and knowledge is passed textually

Mass storage of texts

Text Processing

Text Retrieval

News and knowledge is passed orally

Mass storage of speech

Speech Processing

Audio Retrieval


presentation source

Pervasive Speech and

Language Technology

Pervasive Speech and Language Technology

Pervasive Speech and Language Technology

Three Levels of Language Processing

Challenges for Language Engineering

Verbmobil

Context-Sensitive Speech-to-Speech Translation

Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

Speech-to-Speech Translation

The Control Panel of Verbmobil

General Speech Recognition Task

Word Hypotheses Graphs (WHGs)

WHGs realize the interface between acoustic and linguistic processing

From Multi-Agent Architectures to a Multi-

Blackboard Architectures

Competing Strategies for Robust Speech

Translation

Unit Selection Algorithm

Conclusion I

Conclusion II

Conclusion III

It is much more than a balancing act...

(see Klavans and Resnik 1996)

Open Problems for the Next Decade

A Speculative Conclusion (+50 years)

Related documents

Products

Support

presentation source

Pervasive Speech and

Language Technology

Pervasive Speech and Language Technology

Pervasive Speech and Language Technology

Three Levels of Language Processing

Challenges for Language Engineering

Verbmobil

Context-Sensitive Speech-to-Speech Translation

Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

Mobile Speech-to-Speech Translation of

Spontaneous Dialogs

Speech-to-Speech Translation

The Control Panel of Verbmobil

General Speech Recognition Task

Word Hypotheses Graphs (WHGs)

WHGs realize the interface between acoustic and linguistic processing

From Multi-Agent Architectures to a Multi-

Blackboard Architectures

Competing Strategies for Robust Speech

Translation

Unit Selection Algorithm

Conclusion I

Conclusion II

Conclusion III

It is much more than a balancing act...

(see Klavans and Resnik 1996)

Open Problems for the Next Decade

A Speculative Conclusion (+50 years)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib