Intonation and Multi-Language Scenarios

advertisement
Intonation and Multi-Language
Scenarios
Andrew Rosenberg
Candidacy Exam Presentation
October 4, 2006
Talk Overview
 Use and Meaning of Intonation
 Automatic Analysis of Intonation
 “Multi-Language Scenarios”
 Second Language Learning Systems
 Speech-to-Speech Translation
10/04/2006
2
Use and Meaning of Intonation
 Why do multi-language scenarios need
intonation?
 Intonation indicates focus and contrast
 Intonation disambiguates meaning
 Intonation indicates how language is being used
 Discourse structure, Speech acts, Paralinguistics
10/04/2006
3
Examples of Intonational Features
ToBI Examples
 Categorical Features

Pitch Accent
 H* - Mariana(H*) won it
 L+H* - Mariana(L+H*) won it
 L* - Will you have marmalade(L*) or jam(L*)

Phrase Boundaries
 Intermediate Phrase Boundary (3)
 Intonational Phrase Boundary (4)
 Oh I don’t know (4) it’s got oregano (3) and marjoram (3) and some fresh
basil (4)
 Continuous Features



10/04/2006
Pitch
Intensity
Duration
4
Use and Meaning of Intonation
Paper List

Emphasis

Accent is Predictable (If You’re a Mind Reader)
Bolinger, 1972

Prosodic Analysis and the Given/New Distinction
Brown, 1983

The Prosody of Questions in Natural Discourse
Hedberg and Sosa, 2002

Syntax

The Use of Prosody in Syntactic Disambiguation
Price, et al., 1991

Discourse Structure

Prosodic Analysis of Discourse Segments in Direction Giving Monologues
Nakatani and Hirschberg, 1996

Paralinguistics

Acoustic Correlates of Emotion Dimension in View of Speech Synthesis
Schröder, et al., 2001
10/04/2006
5
Accent is Predictable (If You're a Mind Reader)
Dwight Bolinger, 1972
Harvard University

Nuclear Stress Rule



Stress is assigned to the rightmost stress-able vowel in a major constituent
(Chomsky and Halle 1968) “Once the speaker has selected a sentence with a
particular syntactic structure and certain lexical items...the choice of stress
contour is not a matter subject to further independent decision”
Selected Counterexamples to NSR

Coordinated Infinitives can be accented or not


Terminal prepositions are rarely accented


Why are you coming indoors? -- I’m coming indoors because the sun is shining
Predictable or less semantically rich items are less likely to be accented


10/04/2006
I need a light to read by
Focus v. Topic v. Comment


I have a clock to clean and oil v. I still have most of the garden to weed and
fertilize
I have a point to make v. I have a point to emphasize
I’ve got to go see a guy v. I’ve got to go see a friend [semi-pronouns?]
6
Prosodic Analysis and the Given/New
Distinction
Gillian Brown, 1983

Information Structure of Discourse Entities (Prince 1981)




Experiment



Given (or Evoked) Information: “recoverable either anaphorically or
situational” (Halliday,1967)
New Information: “non recoverable...”
Inferable Information: e.g. driver is inferable given bus
One subject was asked to describe a diagram to another who would
reproduce it.
Entities are marked as new/brand-new, new/inferred, evoked/context (pen,
paper, etc.), evoked/current (most recent mention) or evoked/displaced
(previously mentioned)
Prominence realizations


10/04/2006
87% of new/brand-new and 79% of new/inferred entities
2% of evoked/context, 0% of evoked current, 4% of evoked/displaced
7
The Prosody of Questions in Natural
Discourse
Nancy Hedberg and Juan Sosa, 2002
Simon Fraser University


Accenting behavior in question types
Wh-Questions (whq) vs. Yes/No Questions (ynq) in spontaneous speech
from “McLaughlin Group” and “Washington Week”

The “locus of interrogation”

Either Wh-word or Fronted Auxiliary Verb




Wh-words are often accented with L+H* and rarely deaccented
Ynqs show no consistent accenting behavior



Whqs are produced with falling intonation 80% of the time
Only 34% of Ynqs are produced with rising intonation
Topic Pitch Accent

10/04/2006
70.5% of positive ynqs deaccent
88% of negative ynqs use L+H*
Nuclear Tune



Where are you?
Do you like pie?
The topic of both whqs and ynqs are less often accented with L+H* than the locus
of interrogation
8
The Use of Prosody in Syntactic
Disambiguation
Patti Price1, Mari Ostendorf2, Stefanie Shattuck-Hufnagel3, Cynthia Fong2, 1991
1SRI, 2Boston University, 3MIT


Relationship between syntax and intonation.
Methodology






7 Types of syntactically ambiguous sentences spoken by 4 professional radio
announcers
Ambiguous sentences were produced within disambiguating paragraphs.
The speakers were not informed of the sentence of interest and only produced
one version per session.
Subjects selected the more appropriate surrounding context.
Subjects only rated one version per session.
Analysis


Manual labelling of phrase breaks and accents [not ToBI]
Phrase breaks and their relative size differentiate the two versions.

10/04/2006
Characterized by lengthening, pauses and boundary tone
9
Example Syntactic ambiguities

Parentheticals v. non-parenthetical clause



Apposition v. attached NP



[Only one remembered,][the lady in red]
[Only one remembered the lady in red]
Main clauses w/ coordinating conjunction v. main and subordinate clause



[Mary knows many languages,][you know]
[Mary knows many languages (that) you (also) know]
[Jane rides in the van][and Ella runs]
[Jane rides in the van Ann Della runs]
Tag question v. attached NP


10/04/2006
[Mary and I don’t believe][do we?]
[Mary and I don’t believe Dewey.]
10
Example Syntactic ambiguities

Far v. Near attachment of final phrase



Left v. Right attachment of middle phrase



[Raoul murdered the man][with the gun] (Raoul has a gun)
[Raoul murdered [the man with the gun]] (the man has a gun)
[When you learn gradually][you worry more]
[When you learn][gradually you worry more]
Particles v. Prepositions


10/04/2006
[They may wear down the road] (the treads hurt the road)
[They may wear][down the road] (the treads erode)
11
Prosodic Analysis of Discourse Segments in
Direction Giving Monologues
Julia Hirschberg1 and Christine Nakatani2,1996
1AT&T Labs, 2Harvard University

Intonation is used to indicate discourse structure



Boston Directions Corpus


Is a speaker beginning a topic? ending one?
Entails a broader information (linguistic, attentional, intentional) structure than
given/new entity status.
Manual ToBI annotation, and Discourse segmentation
Acoustic-prosodic correlates of discourse segment initial, medial and final
phrases.

Segment Initial v. non-initial




Segment Medial v. Final


10/04/2006
Higher max, mean F0 and Energy.
Longer preceding and shorter following pauses
Increases in F0 and Energy from previous phrase
Medial has a slower speaking rate and shorter subsequent pause
Relative increase in F0 and Energy from previous phrase
12
BDC Discourse structure example


[describe green line portion of journey]
and get on the Green Line
 [describe direction to take on green line]
we will take the Green Line
south
toward Park Street
 [describe which green line to take (any)]
we can get on any of the Green Lines
at Government Center
and take them south
to Park Street
 [describe getting off the green line]
once we are at Park Street we will get off
[describe red line portion of journey]
and get on the red line
of the T
10/04/2006
13
Acoustic Correlates of Emotion Dimension in
View of Speech Synthesis
Marc Schröder1, Roddy Cowie2, Ellen Douglas-Cowie2, Machiel Westerdijk3, Stan Gielen3, 2001
1University of Saarland, 2Queen’s University, 3University of Nijmegen
 Paralinguistic Information
 That information that is transmitted via language that is not
strictly “linguistic”.
 E.g. emotion, humor, charisma, deception
 Emotional Dimensions




Activation - Degree of readiness to act
Evaluation - Positive v. Negative
Power - Dominance v. Submission
For example,
 Happiness - High Activation, High Evaluation, High Power
 Anger - High Activation, Low Evaluation, Low Power
 Sadness - Low Activation, Low Evaluation, Very Low Power
10/04/2006
14
Acoustic Correlates to Emotion Dimension
ctd.
 Manual annotation of emotional content of spontaneous
speech from 100 speakers in activation-evaluation-power
space.
 High Activation strongly correlates

High F0 mean and range, longer phrases, shorter pauses, large and
fast F0 rise and fall, increased intensity, flat spectral slope
 Negative Evaluation correlates

Fast F0 falls, long pauses, increased intensity, more pronounced
intensity maxima
 High Power correlates

10/04/2006
Low F0 mean, (female) shallow F0 rise and falls, reduced intensity
(male) increased intensity
15
Use and Meaning Of Intonation
Summary
 Intonation can provide information about:
 Focus
 Contrast
 I want the red pen (...not the blue one)





10/04/2006
Information Status (given/new)
Speech Acts
Discourse Status
Syntax
Paralinguistics
16
Automatic Analysis of Intonation
 How can the information transmitted via
intonation be understood computationally?
 What computational techniques are available?
 How much human annotation is needed?
10/04/2006
17
Automatic Analysis of Intonation
Paper List (1/2)

Supervised Methods

Automatic Recognition of Intonational Features
Wightman and Ostendorf, 1992

An Automatic Prosody Recognizer Using a Coupled Multi-Stream Acoustic Model and a
Syntactic-Prosodic Language Model
Ananthakrishnan and Narayanan, 2005

Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous
Speech
Ishi, et al., 2003

Alternate ways of representing Intonation

Direct Modeling of Prosody: An Overview of Applications in Automatic Speech
Processing
Shriberg and Stolcke 2004

The Tilt Intonation Model
Taylor, 1998
10/04/2006
18
Automatic Analysis of Intonation Paper
List (2/2)

Unsupervised Methods

Unsupervised and Semi-supervised Learning of Tone and Pitch Accent
Levow, 2006

Reliable Prominence Identification in English Spontaneous Speech
Tamburini, 2006

Feature Analysis

Spectral Emphasis as an Additional Source of Information in Accent Detection
Heldner, 2001

Duration Features in Prosodic Classification: Why Normalization Comes Second, and
what they Really Encode.
Batliner, et al., 2001
10/04/2006
19
Supervised Methods
 Require annotated data
 Pitch Accent and Phrase Boundaries are the
two main prosodic events that are detected
and classified
10/04/2006
20
Automatic Recognition of Intonational
Features
Colin Wightman and Mari Ostendorf, 1992
Boston University
 Detection of boundary tones and pitch accents on syllables
 Decision tree-based acoustic quantization for use with an HMM

Four-way classification {Pitch Accent, Boundary Tone, Both, Neither}
 Features






Is the syllable lexically stressed?
F0 contour representation
Max, min F0 context normalization
Duration
Pause information
Mean energy
 Results


10/04/2006
Prominence: Correct 86% False alarm 14%
Boundary Tone: Correct 77% False alarm 3%
21
An Automatic Prosody Recognizer Using a Coupled MultiStream Acoustic Model and a Syntactic-Prosodic Language
Model
Shankar Aranthakrishnan and Shrikanth Narayanan, 2005
University of Sothern California

There are three asynchronous information streams that contribute to intonation




Coupled HMM trained on 1 hour of radio news speaker data with ASR hypotheses
and POS tags



Tag syllable as long/short, stressed/unstressed, boundary/non-boundary
Includes language model relating POS and prosodic events
Syntax alone provides the best results for boundary tone detection:


Pitch - duration and distance from mean of piecewise linear fit of f0
Energy - frame level intensity normalized w.r.t utterance
Duration - normalized vowel duration of the current syllable and following pause
duration
Correct 82.1% False Alarm 12.93%
Stress detection false alarm rate is nearly halved by inclusion of acoustic
information


10/04/2006
Syntax alone: 79.7% / 22.25%
Syntax + acoustics: 79.5% / 13.21%
22
Perceptually-related Acoustic-Prosodic Features of
Phrase Finals in Spontaneous Speech
Carlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, 2003
ATR/Human Information Science Labs
 Phrase-final behavior can indicate speech act,
certainty, discourse/topic structure, etc.
 Classification of phrase-final behavior in Japanese
 Pitch features
 Mean F0 of first and second half of phrase final
 Pitch target of first and second half
 Min, max, (pseudo-) slope, reset of phrase final
 Using a classification tree, 11 tone classes could be
classified with 75.9% accuracy
 Majority class baseline: 19.6%
10/04/2006
23
Perceptually-related Acoustic-Prosodic Features of
Phrase Finals in Spontaneous Speech
Carlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, 2003
ATR/Human Information Science Labs
Tone Type
Perceptual Properties (Hattori 2002) X-JToBI BPM
1a
Low
L%
1b
Low+Falling tone
L%
1bE
Low+Falling+Extended
L%
2a
High
L%+H%
2aA
High+Aspirated
L%+H%
2b
High+Lengthened
L%+H%>
2c
Low+Rising tine
L%+LH%
2cE
Low+Rising+Extended
L%+LH%
2cS
Low+Rising+Short
L%+LH%
3
High+Falling tone
L%+HL%
5
High+Fall-Rise tone
L%+HLH%
10/04/2006
24
Direct Modeling of Prosody: An Overview of
Applications in Automatic Speech Processing
Elizabeth Shriberg and Andreas Stolcke, 2004
SRI, ICSI
 Do we need to explicitly model prosodic features?
 Why not provide acoustic/prosodic information directly to other
statistical models?
 Task-based integration of features and models
 Event Language Model

Augment a typical n-gram language model with prosodic event classes
 Event Prosodic Model

Grow decision trees or use GMMs to generate P(Event|Signal)
 Continuous Prosodic Features



10/04/2006
Duration from ASR
Pitch, energy, voicing normalizations and stylizations
Task specific features: e.g. Number of repeat attempts
25
Direct Modeling of Prosody
Tasks
 Structural Tagging



Sentence/topic boundary and disfluency (interruption point) detection
Uses Language Model + Event Prosodic Model
Sentence boundary results
 Telephone: accuracy improved 7%
 BN: 19% error reduction
 Pragmatics/Paralinguistics



Dialog act classification and frustration detection
Uses Language Model and Dialog “grammar” + Event Prosodic Model
Results:
 Statement v. Question 16% error reduction
 Agreement v. Backchannel 16% error reduction
 Frustration 27% error reduction (using “repeated attempt” feature)
10/04/2006
26
Direct Modeling of Prosody
Tasks
 Speaker Recognition
 Typical approaches use spectral information
 Use Continuous Prosodic Features
 Including phone duration features can reduce error by
50%
 Word Recognition
 Words can be recognized simultaneously with prosodic
events (Event Language Model)
 Spectral and prosodic information can be used to model
word hypotheses
 Phone duration, pause information along with sentence and
disfluency detection reduces error by 3.1%
10/04/2006
27
The Tilt Intonation Model
Paul Taylor, 1998
University of Edinburgh
 The Tilt Model describes accent and boundary tones
as “intonational events” characterized by pitch
movement
 Events (accent, boundary, neither, silence) are
automatically detected using an HMM with pitch,
energy, and first and second order difference of both
 Accuracy ranged from 35%-47% with correct identification
of events between 60.7% and 72.7%
 Tilt parameter was then extracted from the HMM
hypotheses.
 F0 synthesis with machine and human derived Tilt
parameters differed by < 1Hz rmse on DCEIM test set
10/04/2006
28
Tilt parameter
tilt 
Arise  A fall
2(Arise  A fall )

Drise  Dfall
2(Drise  Dfall )

10/04/2006
29
Unsupervised Models of Intonation
 Annotating Intonation is expensive
 100x real time for full ToBI labeling
 Human Annotations are errorful
 Human agreement ranges from 80-90%
 Unsupervised Methods are
1. Inexpensive
 Data doesn’t require manual annotation
2. Consistent
 Performance is not reliant on human consistency
10/04/2006
30
Unsupervised and Semi-supervised Learning of
Tone and Pitch Accent
Gina-Anne Levow, 2006
University of Chicago
 What can we do without gold-standard data
 Also, does Lexical Tone in Mandarin Chinese vary in similar
dimensions as Pitch Accent?
 Semi- and Unsupervised speaker-dependent clustering into
4 accent classes (unaccented, high, low, downstepped)
 Forced alignment-based syllable Features:
 Speaker normalized f0, f0 slope and intensity
 Context: prev. following syllables values and first order differences
 Semi-supervised: Laplacian Support Vector Machines
 Tone (clean speech): 94% accuracy (99% supervised)
 Pitch Accent (2-way): 81.5% accuracy (84% supervised)
 Unsupervised: k-means clustering, Asymmetric k-lines clustering
 Tone (clean speech): 77% accuracy
 Pitch Accent (4-way): 78.4% accuracy (80.1% supervised)
10/04/2006
31
Reliable Prominence Identification in English
Spontaneous Speech
Fabio Tamburini, 2005
University of Bologna
 Unsupervised metric to describe the prominence of a
syllable
 Calculated over the nucleus
 Prom = en500-4000 + dur + enov (Aevent + Devent)




High spectrum energy
Duration
Full spectrum energy
Tilt parameters (f0 amplitude and duration)
 By tuning a threshold, 18.64% syllable error rate on
TIMIT
10/04/2006
32
Feature Analysis
 Intonation is generally assumed to be realized
as a modification of
 Pitch
 Energy
 Duration
 How do each of these contribute to realization
of specific prosodic events?
10/04/2006
33
Spectral Emphasis as an Additional Source of
Information in Accent Detection
Mattias Heldner, 2001
Umeå University
 Close inspection of spectral emphasis as discriminating
accented and non-accented syllables in read Swedish
 Spectral emphasis: difference (in dB) of energy in the first
formant and full spectrum

First formant energy was extracted using a dynamic low pass filter with
a cut off that followed f0
 Classifier: The word in a phrase with the highest spectral
emphasis/intensity/pitch is “focally accented”.
 Results:



10/04/2006
Spectral Emphasis: 75% correct
Overall Intensity: 69% correct
Pitch peak: 67% correct
34
Duration Features in Prosodic Classification: Why
Normalization Comes Second, and what they Really Encode
Anton Batliner, Elmar Nöth, Jan Buckow, Richard Huber, Volker Warnke, Heinrich Niemann, 2001
University of Erlangen-Nuremberg


When vowels are stressed, accented or phrase-final they tend to be
lengthened. What’s the best way to measure the duration of a word?
Duration is normalized in three ways

DURNORM - normalized w.r.t. ‘expected’ duration




Expected duration calculated by the mean and std.dev. of a vowel scaled by a ROS
approximation.
DURSYLL - normalized w.r.t. number of syllables
DURABS - raw duration
In both German and English on boundary and accent tasks, DURABS
classified the best followed by DURSYLL followed by DURNORM

Duration inadvertently encodes semantic information


10/04/2006
Complex words tend to have more syllables and tend to be accented more
frequently; common words (particles, backchannels) tend to be shorter
DURNORM and DURSYLL are able to classify well (if worse) despite
obfuscating this information
35
Automatic Analysis of Intonation
Summary
 Various of models can be used to analyze both
pitch accents and phrase boundaries:
 Supervised
 Direct Discriminative modelling
 Semi- and unsupervised learning
 Research has also examined how accents and
phrase breaks are realized in a constrained
acoustic dimensions
10/04/2006
36
Second Language Learning Systems
 Automated systems can be used to improve
pronunciation and intonation of second
language learners.
 Native intonation is rarely emphasized in
classrooms and is often the last thing nonnative speakers learn.
 Focus will be more on computational
approaches (diagnosis, evaluation) over
pedagogical concerns
10/04/2006
37
Second Language Learning Systems
Paper List (1/2)

Pronunciation Evaluation

The SRI EduSpeakTM System: Recognition and Pronunciation Scoring for Language
Learning


Automatic Localization and Diagnosis of Pronunciation Errors for Second-Language
Learners of English


Herron, et al., 1999
Automatic Syllable Stress Detection Using Prosodic Features for Pronunciation
Evaluation of Language Learners

10/04/2006
Franco, et al., 2000
Tepperman and Narayanan, 2005
38
Second Language Learning Systems
Paper List (2/2)

Fluency, Nativeness and Intonation Evaluation

A Visual Display for the Teaching of Intonation


Quantitative Assessment of Second Language Learner’s Fluency: An Automatic
Approach


Imoto, et al., 2002
A study of sentence stress production in Mandarin speakers of American English

10/04/2006
Teixeira, et al., 2000
Modeling and Automatic Detection of English Sentence Stress for Computer-Assisted
English Prosody Learning System


Cucchiarini, et al., 2002
Prosodic Features for Automatic Text-Independent Evaluation of Degree of
Nativeness for Language Learners


Spaai and Hermes, 1993
Chen, et al., 2001
39
Pronunciation Evaluation
 The segmental context and lexical stress of a
production determines whether it is
pronounced correctly or not.
10/04/2006
40
The SRI EduSpeakTM System: Recognition and Pronunciation
Scoring for Language Learning
Horacio Franco, Victor Abrash, Kristin Precoda, Harry Bratt, Ramana Rao, John Butzberger, Romain Rossier,
Federico Cesari, 2000
SRI
 Recognition


Non-native speech recognition is errorful.
A native HMM recognizer was adapted to non-native speech.
 Non-native WER was reduced by half, while not affecting native
performance
 Pronunciation Evaluation

Combine scores using a regression tree to generate scores that
correlate with scores from human raters
 Spectral Match: Compare the spectrum of a candidate phone to a native,
context-independent phone model.

Also used for mispronunciation detection
 Phone Duration: Compare the candidate duration to a model of native
duration, normalized by rate of speech
 Speaking rate: phones/sentence
10/04/2006
41
Automatic Localization and Diagnosis of
Pronunciation Errors
Daniel Herron1, Wolfgang Menzel1, Erica Atwell2, Roberto Bisiani6, Fabio Deaneluzzi4, Rachel Morton5, Juergen
Schmidt3, 1999
1U. of Hamburg, 2U. of Leeds, 3Ernst Klett Verlag, 4Dida*El S.r.l., 5Entropic, 6U. of Milan-Bicocca
 Locating and describing errors is critical for instruction
 Identifying segmental errors


In response to a read prompt, lax recognition followed by strict
recognition
Some errors are predictable based on L1.
 Vowel, pre-vocalic consonant, and word-final devoicing errors are
modelled explicitly, and tested on artificial data.



Vowel - /ih/ -> /ey/ “it” -> “eet”
PV consonant - /w/ - > /v/ “was” -> “vas”
WF devoicing - /g/ -> /k/ “thinking” -> “thinkink”
 Using a word-internal tri-phone model vowel and PV consonant errors
can be diagnosed with >80% accuracy with a FA rate less than 5%. WF
devoicing can only be diagnosed with ~40% accuracy.
 Stress-errors are detected by deviation from trained models of
stressed and unstressed syllables
10/04/2006
42
Automatic Syllable Stress Detection Using Prosodic Features
for Pronunciation Evaluation of Language Learners
Joseph Tepperman and Shrikanth Narayanan, 2005
University of Southern California
 Lexical stress can change POS and meaning

“INsult” v. “inSULT” or “CONtent” v. “conTENT”
 Detecting stress on read speech with content determined a
priori





10/04/2006
Use forced alignment to id syllable nuclei (vowels)
Extract f0 and energy features. Duration features are manually
normalized by context.
Classified using a supervised Gaussian Mixture Model
Post-processed to guarantee exactly 1 stressed classification per
word.
Mean f0, energy and duration discriminate with >80% accuracy on
English spoken by Italian and German speakers
43
Nativeness, Fluency and Intonation
Evaluation
 Intonational information can influence the
proficiency and understandability of a secondlanguage speaker
 Proficient second-language speakers often
have difficulty producing native-like intonation
10/04/2006
44
A Visual Display for the Teaching of Intonation
Gerard Spaai and Dik Hermes, 1993
Institute for Perception Research
 Tools for guided instruction of intonation
 Intonation is difficult to learn
 It is acquired early, so it is resistant to change
 Native language intonation expectations may impair
the perceptions of foreign intonation
 Intonation Meter
 Display the pitch contour.
 Interpolate non-voiced region
 Mark vowel onsets
10/04/2006
45
Quantitative Assessment of Second Language
Learners' Fluency: An Automatic Approach
Catia Cucchiarini, Helmer Strik and Lou Boves, 2002
University of Nijmegen
 Does ‘fluency’ always mean the same thing?

Linguistic knowledge, segmental pronunciation, native-like intonation.
 Three groups of raters, 3 phoneticians, 6 speech therapists,
assessed fluency of read speech on a scale from 1-10.


With 1 exception the raters agreed with  > .9
Native speakers are consistently rated as more fluent than non-native
 Time/Lexical correlates to fluency




10/04/2006
High rate of speech (segments/duration)
High phonation/time (+/- pauses) ratio
High mean length of runs
Low number & duration of pauses
46
Prosodic Features for Automatic Text-Independent Evaluation
of Degree of Nativeness for Language Learners
Carlos Teixeira1,2, Horacio Franco2, Elizabeth Shriberg2, Kristin Precoda2, Kemal Sönmez2, 2000
1IST-UTL/INESC, 2SRI


Can a model be trained to assess speakers nativeness similarly to humans without
text information?
Construct Feature-Specific Decision Trees







Word stress (duration of longest vowel, duration of lexically stressed vowel, duration of
vowel with max f0)
Speaking rate approximations (durations between vowels)
Pitch (max, slope “bigram” modeling)
Forced alignment + pitch (duration between max f0 to longest vowel nucleus, location of
max f0)
Unique events (durations of longest pauses, longest words)
Combination (max or expectation) of “posterior probabilities” from decision
trees
Results

Pitch-based features do not generate human-like scores


Inclusion of posterior recognition scores and rate of speech helps considerably.

10/04/2006
Only weak correlation (<.434) between machine and human scores
Correlation = ~.7
47
Modeling and Automatic Detection of English Sentence Stress
for Computer Assisted English Prosody Learning System
Kazunori Imoto, Yasushi Tsubota, Antione Rau, Tatsuya Kawahara and Masatake Dantsuji,
2002
Kyoto University

L1 specific errors need to be accounted for




Japanese speakers tend not to use energy and duration to indicate stress
Syllable structure “strike” -> /s-u-t-o-r-ay-k-u/
Incorrect phrasing
Classification of stress levels




Syllable alignment was performed with a recognizer trained with common native
Japanese English speech (including segmental errors)
Supervised HMM training using pitch, power, 4-th order MFCC & first and second order
differences
Using distinct models for each stress type/syllable structure/position combination (144
HMMs), 93.7%/79.3% native/non-native accuracies were achieved
Two stage recognition increased accuracy to 95.1%/84.1%


10/04/2006
Primary + Secondary stress v. Non-stressed
Primary v. Secondary stress
48
A study of sentence stress production in
Mandarin Speakers of American English
Yang Chen1, Michael Robb2, Harvey Gilbert2 and Jay Lerman2, 2001
1University of Wyoming, 2University of Connecticut
 Do native Mandarin speakers produce American English pitch
accents “natively”?
 Experiment


Compare native Mandarin English and native American English
productions of “I bought a cat there” with varied location of pitch accent.
Pitch Energy and duration of vowels were calculated and compared
across language group and gender
 Vowel onset/offset were determined manually.
 Results


10/04/2006
Mandarin speakers produced stressed words with shorter duration than
American speakers.
Female mandarin speakers produced stressed words with greater rise
in f0
49
Second Language Learning Systems
Summary
 Performance assessment
 Pronunciation
 Intonation
 Error diagnosis and (Instruction)
 Influence of L1 on L2 instruction and
evaluation
10/04/2006
50
Speech-to-Speech Translation
 ASR, MT and TTS components all exist
independently
 Challenges specific to translation of speech
 Can speech information be used to reduce the
impact of ASR errors on MT?
 Can information conveyed by intonation be
translated via this framework?
10/04/2006
51
Speech-to-Speech Translation
Paper List

Cascaded Approaches

Janus-III: Speech-to-Speech Translation in Multiple Languages


A Unified Approach in Speech Translation: Integrating Features of Speech Recognition
and Machine Translation


Zhang et al. 2004
Explicit Use of Prosodic Information

On the Use of Prosody in a Speech-to-Speech Translator


Strom et al. 1997
A Japanese-to-English Speech Translation System: ATR-MATRIX


Lavie et al. 1997
Takezawa et al. 1998
Integrated Approaches

Finite State Speech-to-Speech Translation


On the Integration of Speech Recognition and Statistical Machine Translation


Matusov 2005
Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation

10/04/2006
Vidal 1997
Gao 2003
52
Cascaded Approach to Speech-to-Speech
Translation
ASR
10/04/2006
MT
TTS
53
Janus-III: Speech-to-Speech Translation in Multiple
Languages
Alon Lavie, Alex Waibel, Lori Levin, Michael Finke, Donna Gates, Marsal Galvadà, Torsten Zeppenfeld
, Puming Zhan, 1998
Carnegie Mellon University, University of Karlsruhe

Interlingua and Frame-Slot based Spanish-English translation


limited domain (conference registration) spontaneous speech
Two semantic parse techniques

GLR* Interlingua parsing (transcript 82.9%; ASR 54%)



Phoenix (transcript 76.3%; ASR 48.6%)



Identifies key concepts and their structure
Parsing grammar contains specific patterns which represent domain concepts and a
generation structure
Phoenix is used as a back-off when GLR* fails.


Manually constructed, robust grammar to parse input into interlingua
Search for the maximal subset covered by the grammar
Transcript: 83.3%; ASR 63.6%
Late stage disambiguation


10/04/2006
Multiple translations are processed through the whole system.
Translation hypothesis selection occurs just before generation using scores from
recognition, parsing and discourse processing.
54
A Unified Approach in Speech-to-Speech Translation:
Integrating Features of Speech Recognition and Machine
Translation
Ruiqiang Zhang, Genichiro Kikui, Hirofumi Yamamoto, Taro Watanabe, Frank Soong, Wai Kit Lo, 2004
ATR


Process many hypotheses, then select one.
In a cascaded architecture:



Rescore MT hypotheses based on weighted log-linear combination of ASR
and MT model scores


HMM-based ASR produces N-best recognition hypotheses
IBM Model 4 MT (a noisy channel model) processes all N.
Construct the feature weight model by optimizing for a translation distance
metric (mWER, mPER, BLEU, NIST) using Powell’s search algorithm
Experiment Results



Corpus: 162k/510/508 Japanese-English parallel sentences
Baseline: no optimization of MT features
Significant improvement was obtained by optimizing MT feature weights based
on distance metric

10/04/2006
Additional improvement is achieved by including ASR features
55
Explicit Use of Prosodic Information
 How can prosodic information improve
translation?
 How can prosodic information be translated?
10/04/2006
56
On The Use of Prosody in a Speech-to-Speech
Translator
Volker Strom1, Anja Elsner1, Wolfgang Hess1, Walter Kasper4, Alexandra Klein2, Hans Ulrich Krieger4,
Jörg Spilker3, Hans Weber3 and Günther Görz3, 1997
1University of Bonn, 2University of Wien, 3University of Erlangen-Nürnberg, 4DFKI GmbH


INTARC - German-English Translator produced for VERBMOBIL project.

Spontaneous, limited domain (appointment scheduling)

80 minutes of prosodically labeled speech
Phrase Boundary (PB) Detector


Focus Detector


Gaussian classifier based on F0, energy and time features with a 4 syl. window (acc.
80.76%)
Rule based approach: Identifies location of steepest F0 decline (acc. 78.5%)
Syntactic parsing search space is reduced by 65%


10/04/2006
Baseline syntactic parsing uses

Decoder factor: product of acoustic and bi-gram scores

Grammar factor: grammar model probability of a parse using the hypothesized word
Prosody factor: 4-gram model of words and phrase boundaries
57
On The Use of Prosody in a Speech to
Speech Translator

Semantic parsing search space is reduced by 24.7%

The semantic grammar was augmented, labeling rules as “segmentconnecting”(SC) and “segment-internal” (SI)


SC rules are applied when there is a PB between segments, SI are applied when there
are not.

Ideal phrase boundaries reduced the number of hypotheses by 65.4% (analysis
trees by 41.9%)

Automatically hypothesized PBs required a backoff mechanism to handle errors
and PBs that are not aligned with grammatical phrase boundaries.
Prosodically driven translation is used when deep transfer (translation)
fails

A focused word determines (probabilistically) a dialog act which is translated
based on available information from the word chain.

10/04/2006
Correct: 50%, Incomplete: 45%, Incorrect: 5%
58
A Japanese-to-English Speech Translation System:
ATR-MATRIX
Toshiyuki Takezawa, Tsuyoshi Morimoto, Yoshinori Sagisaka, Nick Campbell, Hitoshi Iida, Fumiaki
Sugaya, Aiko Yokoo and Seiichi Yamamoto, 1998
ATR


Limited domain translation system (Hotel Reservations)
Cascaded approach





ASR: sequential model ~2k word vocabulary
MT: syntactically driven ~12k word vocabulary
TTS: CHATR (concatenative synthesis)
Early Example of “Interactive” Speech-to-Speech Translation.
Speech Information is used in three ways in ATR-MATRIX

Voice Selection


Hypothesized phrase boundaries


Using pause information along with POS N-gram information the source utterance
is divided into “meaningful chunks” for translation.
Phrase Final Behavior

10/04/2006
Based on the source voice, either a male or female voice is used for synthesis
If phrase final rise is detected, it is passed to the MT module as a “lexical” item
potentially indicating a question.
59
Integrated Approach to Speech-to-Speech
Translation
ASR+MT
10/04/2006
TTS
60
Finite-State Speech-to-Speech Translation
Enrique Vidal, 1997
Universidad Politécnica de Valencia
 FSTs can naturally be applied to translation.

FSTs for statistical MT can be learned from parallel corpora. (OSTIA)
 Speech input is handled in two ways:


Baseline cascaded approach
Integrated approach
1. Create an translation FST on parallel text
2. Replace each edge with an acoustic model of the source lexical item
 A major drawback of using this approach is large training data
requirement.


10/04/2006
Align the source and target utterances, reducing their “asynchronicity”
Cluster lexical items, reducing the vocabulary size
61
Finite-State Speech-to-Speech Translation
Experiments
 Proof of concept experiment

Text: ~30 lexical items used in 16k paired sentences (Spanish- English)


Greater than 99% translation accuracy is achieved
Speech: 50k/400 (training/testing) paired utterances, spoken by 4
speakers

Best performance: 97.2% translation acc. 97.4% recognition accuracy

Requires inclusion of source and target 4-gram LMs in FST training.
 Travel domain experiment

Text: ~600 lexical items in 169k/2k paired sentences


Speech: 336 test utterances (~3k words) spoken by 4 speakers


10/04/2006
0.7% translation WER w/ categorization; 13.3% WER w/o
Text transducer was used, edges replaced by concatenation of “phonetic
elements” modeled by a continuous HMM.
1.9% translation WER and 2.2% recognition WER were obtained.
62
On the Integration of Speech Recognition and
Statistical Machine Translation
E. Matusov, S. Kanthak and H. Ney
2005


Use word lattices weighted by HMM ASR scores as input to a weighted FST
for translation
Noisy Channel Model from source signal to target text


TextTarget = argmax Pr(TextSource, TextTarget| Align) Pr(Signal| TextSource)
Material: 4 parallel corpora




Spontaneous speech in the travel domain
3k - 66k paired sentences in Italian-English, Spanish-English and Spanish-Catalan
Vocabulary size 1.7k-15k words
Results

On all metrics (mWER, mPER, BLEU, NIST), the translation results are as follows:
1.
2.
3.
4.
5.
10/04/2006
Correct text
Word lattice w/ acoustic scores
Fully integrated ASR and MT (FUB Italian-English only)
Word lattice w/o acoustic scores
Single best ASR hypothesis (lower mPER than lattice w/o scores on FUB I-E)
63
Coupling vs. Unifying: Modeling Techniques
for Speech-to-Speech Translation
Yuqing Gao
2003



Application of discriminative modeling to ASR, with the goal of recognizing
interlingua text for MT.
Composing models (e.g., noisy channel models) can lead to local or suboptimal solutions
Discriminative Modeling tries to avoid these by creating a single
maximum entropy model



p(text|acoustics,...)
Includes other non-independent observations as features.
Major considerations:


10/04/2006
To simplify computational complexity, acoustic features are quantized.
Since the feature vector can get very large, reliable feature selection is
necessary.
 In preliminary experiments, 150M features were reduced to 500K via
feature selection
64
Speech-to-Speech Translation
Summary
 Existing systems can be used to construct speech-tospeech translation systems
 However, two significant problems are encountered
 Intonational Information is generally ignored
 Prosodic Boundaries, Pitch Accent, Affect, etc. are important
information carriers which ASR transcripts do not encode
 Local Minima
 The best recognized string may not generate the best translated
string
10/04/2006
65
Intonation and Multi-Language
Scenarios
 Use and Meaning of Intonation
 What information can intonation provide?
 Automatic Analysis of Intonation
 How can this information be represented
computationally?
 Multi-Language Scenarios
 Second Language Learning Systems
 How can computers help teach a second language?
 Speech-to-Speech Translation
 How can machines translate speech?
10/04/2006
66
Thank you
Questions.
Automatic Analysis of Intonation
Supervised? Detected Events
Algorithm
Yes
Accent, Boundary
DTree->HMM
Ananthakrishnan Yes
Accent, Boundary
CHMM
Ishi
Yes
11 phrase final types
DTree
Shriberg
Yes*
Accent, Boundary,
other.
Many
Taylor
Yes*
“Intonational Events”
HMM
Levow
No
Accent
Spectral Clustering /
Laplacian SVM
Tamburini
No
Accent, Lex. Stress
Threshold tuning
Heldner
Yes
Pitch Accent
Manual Rule
Batliner
Yes
Accent, Boundary
Decision Tree
Wightman
10/04/2006
68
Second Language Learning
Human corr?
L1 Infl.
Seg.
Stress
Timing /
Duration
Supra.
Franco
Yes
No
Yes
No
Yes
No
Herron
Artificial Errors
German /
Italian
Yes
Yes
No
No
Tepperman
No
No
No
Yes
No
No
Spaai
No
No
No
Implicit
Implicit
Yes
Cucciarini
Yes
No
No
No
Yes
No
Teixeira
Yes
No
Yes
No
Yes
Yes
Imoto
No
Japanese No
Yes
No
Yes
Chen
No
Mandarin
No
Yes
Yes
10/04/2006
No
69
Speech-to-Speech Translation
MT approach
Cascaded /
Integrated
Languages
Domain
Lavie
Interlingua
Cascaded
Japanese German
Spanish
Meeting
Scheduling
Zhang
SMT
Cascaded
Japanese
Travel
Strom
Interlingua
Integrated
German
Meeting
Scheduling
Takezawa
SMT
Cascaded
Japanese
Hotel Desk
Vidal
SMT
Integrated
Spanish German
Italian
Travel
Matusov
SMT
Integrated
Italian Spanish
Catalan
Travel /
Scheduling /
Hotel Desk
Gao
Interlingua Generation
Integrated
NA
NA
10/04/2006
70
Download