ppt - ISCA - International Speech Communication Association

advertisement
Human Language Technologies – Text-to-Speech
Automatic Exploration of Corpus-Specific
Properties for Expressive Text-to-Speech.
(A Case Study in Emphasis.)
Raul Fernandez and Bhuvana Ramabhadran
I.B.M. T.J. Watson Research Center
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Outline
• Motivation.
• Review of Expressive TTS Architecture
• Expression Mining: Emphasis.
• Evaluation.
2
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Expressive TTS
We have shown that corpus-based approaches to expressive CTTS manage to convey
expressiveness if the corpus is well designed to contain the desired expression(s).
There are, however, shortcomings to this approach:
 Adding new expressions, or increasing the size of the repository for an existing
one, is expensive and time consuming.
 The footprint of the system increases as we add new expressions.
Without abandoning this framework, we propose to partially address these limitations by
an approach that exploits the properties of the existing databases to maximize the
expressive range of the TTS system.
3
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Some observations about data and listeners…
 Production variability:
 Speakers produce subtle expressive variations, even when
they’re asked to speak in a mostly-neutral style.
4
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Some observations about data and listeners…
 Production variability:
 Speakers produce subtle expressive variations, even when
they’re asked to speak in a mostly-neutral style.
 Perceptual confusability/redundancy:
 Several studies have shown that there’s an overlap in the
way listeners interpret the prosodic-acoustic realizations of
different expressions.
Neutral
Anger
Fear
Sad
5
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Some observations about data and listeners…
 Production variability:
 Speakers produce subtle expressive variations, even when
they’re asked to speak in a mostly-neutral style.
 Perceptual confusability/redundancy:
 Several studies have shown that there’s an overlap in the
way listeners interpret the prosodic-acoustic realizations of
different expressions.
Neutral
Anger
Fear
Sad
6
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Expression Mining
 Goals:
 Exploit the variability present in a given dataset to increase the
expressive range of the TTS engine.
 Augment the corpus-based with an expression-mining
approach for expressive synthesis.
 Challenge:
 Automatic annotation of instances in the corpus where an
expression of interest occurs.
 (Approach may still require collecting a smaller expressionspecific corpus to bootstrap data-driven learning algorithms.)
 Case study: Emphasis.
7
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Outline
• Motivation.
• Review of Expressive TTS Architecture
• Expression Mining: Emphasis.
• Evaluation.
8
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
The Expressive Framework of the IBM TTS System
 The IBM Expressive Text-to-Speech consists of:
 a rules-based front-end for text analysis
 acoustic models (DTs) for generating synthesis candidate units
 prosody models (DTs) for generating pitch and duration targets
 a module to carry out a Viterbi search
 a waveform generation module to concatenate the selected units
 Expressiveness is achieved in this framework by associating symbolic attribute
vectors with the synthesis units. These attribute values are able to influence the
 target prosody generation
 unit-search selection
9
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Attributes
Style
…
Default Attribute
10
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Attributes
Emphasis
1
0
Style
…
11
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Attributes
Emphasis
1
0
Style
…
? (e.g., voice quality={breathy,…}, etc.)
12
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
How do attributes influence the search?
- Corpus is tagged a priori.
- At run time: Input is tagged at the word level (e.g., via user-provided
mark-up) with annotations indicating the desired attribute. Annotations
are propagated down to the unit level.
- A component of the target cost function penalizes label substitutions:
13
Neutral
Good news
Bad news
Neutral
0
0.5
0.6
Good news
0.3
0
1.0
Bad news
0.5
1.0
0
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
How do attributes influence the search?
- Additionally, the style attribute has style-specific prosody models (for
pitch and duration) associated with it. Therefore, prosody targets are
produced according to the style requested.
Normalized Text
Prosody Model
Style 1
14
Prosody Model
Style 2
Model Output
Generation
Prosody
Targets
Prosody Model
Style 3
Target Style
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Outline
• Motivation.
• Review of Expressive TTS Architecture
• Expression Mining: Emphasis.
• Evaluation.
15
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Mining Emphasis
Baseline
Corpus
(~10K sents.)
Emphasis
Corpus
(~1K sents.)
Statistical
Learner
Trained
Emphasis
Classifier
Baseline
Corpus w.
Emphasis
Labels
Build TTS
System w.
Emphasis
16
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Training Materials
 Two sets of recordings, one from a female and one from a male
speaker of US English.
 Approximately 1K sentences in script.
 Approximately 20% of words in script contain emphasis.
 Recordings are single channel, 22.05kHz.
Exs:
To hear DIRECTIONS to this destination say YES.
 I'd LOVE to hear how it SOUNDS.
It is BASED on the information that the company gathers, but not DEPENDENT on it.
17
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Modeling Emphasis – Classification Scheme
- Modeled at the word level.
- Feature set: prosodic features derived from (i) pitch (absolute; speakernormalized), (ii) duration, and (iii) energy measures.
- Individual classifiers are trained, and results stacked (this marginally
improves the generalization performance estimated through 10-fold CV).
Prosodic
Features
K-Nearest
Neighbor
Interm.
Output
Probs.
Naïve
Bayes
SVM
18
Sixth Speech Synthesis Workshop, Bonn, Germany.
Final
Output
Probs.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Modeling Emphasis – Classification Results
TP Rate FP Rate Prec. F-Meas.
0.82
0.06
emphasis
0.78
0.94
0.18
notemphasis
0.95
Class
0.80
M
A
L
0.94
Correctly Classified Instances
F
E
91.2 %
TP Rate FP Rate Prec. F-Meas. Class
E
M
A
L
E
0.80
0.06
emphasis
0.75
0.77
0.93
0.18
notemphasis
0.94
0.94
Correctly Classified Instances
19
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
89.9 %
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
20
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
I think they will diverge from bonds, and they may even go up.
21
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
22
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
Please say the full name of the person you want to call.
23
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
24
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
What does it find in the corpus?
There's a long fly ball to deep center field. Going, going. It's gone, a home
run.
25
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Outline
• Motivation.
• Review of Expressive TTS Architecture
• Expression Mining: Emphasis.
• Evaluation.
26
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Listening Tests – Stimuli and Conditions
Synthesis Sources
Target
Sent.
Type
Emphasis
in Text?
Baseline Neutral
Units
A
N
B
Y


C
Y
Baseline Corpus
w/ Mined
Emphasis
Training Corpus
w/ Explicit
Emphasis



Condition 1 Pair: 1 Type-A sentence vs. 1 Type-B sentence (in random order).
Condition 2 Pair: 1 Type-A sentence vs. 1 Type-C sentence (in random order).
27
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Listening Tests – Setup
B1 vs A1
Condition 1
A2 vs B2
(12 Pairs)
A3 vs B3
…
+
Shuffle
A2 vs C2
L
B1 vs A1
I
B12 vs A12
S
…
T
A3 vs B3
1
B12 vs A12
Reverse Order Pair
A1 vs C1
Condition 2
(12 Pairs)
A2 vs C2
C3 vs A3
…
C12 vs A12
28
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
C2 vs A2
L
A1 vs B1
I
A12 vs B12
S
…
T
B3 vs A3
2
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Listening Tests – Task Description
 A total of 31 participants listen to a playlist (16 to List 1; 15 to List 2)
 For each pair of stimuli, the listeners are asked to select which member of
the pair contains emphasis-bearing words
 No information is given about which words may be emphasized.
 Listeners may opt to listen to a pair repeatedly.
29
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Listening Tests – Results
30
Condition
Neutral (A)
Emphatic (B/C)
1
61.6%
38.4%
2
48.7%
51.3%
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Conclusions
 When only the limited expressive corpus is considered, listeners actually
prefer the neutral baseline. Possible explanation is that biasing the search
heavily toward a small corpus is introducing artifacts that interfere with the
perception of emphasis.
 However, when the small expressive corpus is augmented with automatic
annotations, the perception of intended emphasis increases significantly by 13%
(p<0.001).
 Although further work is needed to reliably convey emphasis, we have
demonstrated the advantages of automatic mining the dataset to augment the
search space of expressive synthesis units.
31
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text to Speech
Future Work
 Explore alternative feature sets to improve automatic emphasis classification.
 Extend the proposed framework to automatically detect more complex
expressions in a “neutral” database and augment the search space for our
expressive systems (e.g., good news; apologies; uncertainty)
A
N
A
GN
GN
U
N
U
Explore how the perceptual confusion between different labels can be
exploited to increase the range of expressiveness of the TTS system.
32
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Human Language Technologies – Text-to-Speech
Automatic Exploration of Corpus-Specific
Properties for Expressive Text-to-Speech.
(A Case Study in Emphasis.)
Raul Fernandez and Bhuvana Ramabhadran
I.B.M. T.J. Watson Research Center
Sixth Speech Synthesis Workshop, Bonn, Germany.
August 22-24, 2007
© 2007 IBM Corporation
Download