IP - SoVideo - Academia Sinica

advertisement
Prosodic Modeling for Detecting
Edit Disfluencies in Transcribing
Spontaneous Mandarin Speech
Che-Kuang Lin, Shu-Chuan Tseng* & Lin-Shan Lee
College of EECS, National Taiwan University,
Institute of Linguistics, Academia Sinica*,
Taipei, Taiwan
1
Outline
•
•
•
•
•
•
•
Introduction
Prosodic features
IP detection models
Latent prosodic modeling (LPM)
LPM-based detection models
Experiment results & further analysis
Conclusion
2
Examples of disfluency considered in this
paper (1/2)
Overt repair
是(shi4) 進口(jin4kou3)
is
import
嗯(EN)
[discourse
particle]
出口(chu1kou3)
嗎(ma1)
export
[interrogative
particle]
Do you import * uhn export products?
reparandum
optional editing term resumption
The disfluency interruption point (IP) (*)
Abandoned utterances
它(ta1) 有(you3) 一個(yi2ge5)
it
has
one
呃(E)
[discourse particle]
有個(you3ge5) 度假村(du4jian4cun1)
has a
resort
那邊(ne4bian1)
there
嘛(MA)
[discourse particle ]
It has a * eh there is a resort there.
reparandum optional editing term resumption
3
Examples of disfluency considered in this
paper (2/2)
Direct repetition
因為(yin1wei4) 因為(yin1wei4) 它(ta1) 有(you3)
because
because
it
has
健身(jian4shen1) 中心(zhong1xin1)
fitness
center
Because * because it has a fitness center.
resumption
The disfluency interruption point (IP) (*)
reparandum
Partial repetition
看(kan4)
電(dian4)
watch
electricity
看(kan4) 電視(dian4shi4)
watch
television
最近(zui4jin4) 有(you3) 新(xin1) 電影(dian4ying3)
recently
has
new
movie
On the tele- * on the television, there is a new film recently.
4
reparandum
resumption
Introduction
• One of the primary problem in spontaneous speech recognition
is the presence of disfluencies
• Accurate identification of various types of disfluencies
– help the recognition process
– provide structural information about the utterances
• Purpose of this study
– To identify useful and important features for interruption point (IP)
detection
– To analyze how these features are helpful in spontaneous Mandarin
speech
5
Define a whole set of prosodic features
for spontaneous Mandarin speech (1/2)
• A set of features have been proposed (Shriberg, 2000) for English
• Spontaneous Mandarin quite different from western languages
– Tonal language nature
• The same syllable with different tones represents different characters
 PCA-based pitch contour smoothing and pitch-related features
– Mono-syllabic structure
• Every character has its own meaning and is pronounced as a monosyllable
• A word is composed of one to several characters (or syllables)
bi-character (bi-syllabic)
word boundary
syllable boundary
6
mono-character (mono-syllabic)
Define a whole set of prosodic features
for spontaneous Mandarin speech (2/2)
• Every syllable boundary (rather than word boundary) is
considered as candidate for IP
• Define a whole set of prosodic features for each syllable
boundary and use them to detect the IPs
bi-character (bi-syllabic)
word boundary
syllable boundary
mono-character (mono-syllabic)
7
Syllable-wise pitch contour smoothing
Conversion
to vectors
fundamental
frequency (Hz)
• We first proposed to use Principal Component Analysis (PCA) for
efficient pitch contour smoothing
400
Tone 2
300
Tone 3
Tone 4
200
100
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
……
frame number
v = c1 x + c2 y
PC1
PCA
V’ =
y
w1 PC1
fundamental
frequency (Hz)
Projection
x
400
Tone 2
300
Tone 3
Tone 4
200
8
100
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Feature Definitions –
Pitch-related Prosodic Features (1/2)
•
•
•
fundamental
frequency (Hz)
•
•
The average pitch value within the syllable
The maximum difference of pitch value within the syllable
The average of absolute values of pitch variations within the
syllable
The magnitude of pitch reset for boundaries
The difference of such feature values of adjacent syllable
boundaries ( P1-P2 , d1-d2 , etc.)
400
Tone 2
300
Tone 4
d2
d1
200
Tone 3
P1
P2
100
1
•
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
frame
number
A total of 54 pitch-related
features
were obtained
9
Feature Definitions –
Duration-related Prosodic Features (2/2)
A
B
C
D
E
fundamental
frequency(Hz)
• deviation from normal speaking rhythmic structure is important
syllable boundary
200
A
B
pause
a
pause
C
b
syllable boundary
D
E
100
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89
begin of utterance
Pause duration b
 Average syllable duration

frame number

syllable duration ratio
(D+E)/(B+C) or (D+E)/2 /C
Combination of pause & syllable
features (ratio or product)
C*b , D*b, C/b, D/b
(B+C+D+E)/4 or ( (D+E)/2 + C )/2
 Average
end of utterance
Lengthening C / ( (A+B)/2 )
 Standard deviation of feature values

• A total of 38 duration-related features were obtained
10
Detection model
11
Decision Tree
• Trees are grown on training data based on maximum entropy
reduction criterion
pitch offset< 12.99?
[ IP, nonIP ]
0.2, 0.8
have pause?
0.6,0.4
syl_dur_ratio<4.5?
0.8,0.2
0.3,0.7
An illustrative decision tree for IP detection
• Probability of IP is found by traversing across the trees down
12
to a certain leaf node
Maximum Entropy Model (1/2)
• Various problem-specific knowledge can be incorporated
into the model through many properly designed feature
functions
• Feature functions fi ( can be binary or real valued )
– Take binary feature function for example
fi ( x, y) 

1
0
if some condition on x and y is satisfied
otherwise
x: prosodic features at the syllable boundary
i: a set of features
y: IP or non-IP
If we have pause at the
boundary (x) and the
boundary is an IP (y)
13
Maximum Entropy Model (2/2)
Known
statistics
Model
the expectation of
each feature functon fi
obtained from the
training data
the expectation of
each feature function
fi with respect to the
desired model
Among all the distributions that satisfy the set of constraints,
choose the one with the highest entropy
14
Integrating DT & Maxent (DT-ME)
• We use decision trees built with the training data to derive the
feature functions for maximum entropy model
• First, grow deep and bushy trees from training data
• Then, for each sample (training or testing)
– Traverse across the trees down to certain leaves
– Each leaf serves as a single (binary) feature function
• i.e. whether the sample falls in this leaf (1) or not (0)
feature vector
0
1
0
0
1
1
0
0
00 1
0
15
Latent Prosodic Modeling
16
Latent prosodic modeling (LPM)
• Model the probabilistic behavior of prosodic features
in terms of latent factors
• Prosodic characters and terms: derived from prosodic
recognized syllables
features
1st pass
speech
recognition
x : prosodic features
VQ
(23,5),(5,0),(23,5,0)…
( prosodic terms)
23,5,0,14,31,…
( prosodic characters)
N-grams
17
Latent prosodic modeling (LPM)
• Prosodic documents of three levels (segment,
utterance & speaker): collections of prosodic terms
d utt ,1
(prosodic documents on segment level)
d seg ,1
d seg ,2
d seg ,3
…
23 5 0
d seg ,3
…
d utt ,1
9 9 17 …
d utt ,2
3 20 1 25 …
…
d spk ,1
11
14 31
d seg ,1 d seg ,2
(The segments are
obtained from the
best fitting piecewise linear function
for the pitch contour)
2 17 13 6 …
…
(prosodic documents
on speaker level)
…
d spk ,2
(prosodic documents
on utterance level)
prosodic characters: 23, 5, 0, 14, 31, …
prosodic documents: dseg,i , dutt,i , dspk,i
prosodic terms: (23,5), (5,0), (23,5,0)…
18
Latent prosodic modeling (LPM)
• The relationship between the prosodic terms and prosodic
documents are modeled via latent prosodic states in the
probabilistic framework of Probabilistic Latent Semantic
Analysis (PLSA)
d1
d2
P(tk| zl)
tk
..
.
…
prosodic dN
documents
t1
t2
..
.
…
di
prosodic
states
z1
z2
P(zl| di) z
l
zL
tN’
prosodic terms
L
P(tk | di )   P(tk | zl ) P( zl | di ) , i, k
l 1
19
Latent prosodic modeling (LPM)
• The probabilities were trained with EM algorithm by
maximizing the total likelihood function:
N
N'
LT   n(tk , di ) log P(tk | di ) ,
i 1 k 1
• The complicated behavior of the prosodic features
can then be analyzed based on these probabilities
– For instance: similarity measures
SimLPM (di , d j ) 
 P( z
l
| di ) P( zl | d j )
l
[ P( z | d )] [ P( z | d )]
2
l
l
2
i
l
l
.
j
20
LPM-based Detection model
21
LPM for IP detection
• LPM-based model adaptation
– LPM model is trained on the raw data and used for actively selecting relevant training
data for a specific testing condition
Latent Prosodic Space
LPM Training
Training
corpus of
training
prosodic
documents
Pre-trained
LPM
Latent Prosodic
Space Construction
Latent Prosodic Space
P( z3 | di )
Projection
Compare &
Select
dk
dj
X
X
P( z1 | di )
Projection
P ( z2 | d i )
testing
prosodic
documents
HAC-based KNN-based
selection
selection
…
Detection Model
Training
LPM-adapted
Detection Models
22
selected prosodic docs LPM-based Detection Model Adaptation
LPM for IP detection
• Anchor model training
– Merging the associated prosodic documents for different
classes of disfluency IPs into super-documents to be used
in training Anchor models
- The prosody of IP candidates was then compared against
these Anchors
- Can be used with training data selection mentioned above
Anchor model training
class 1
class cP( z3 | di )
P ( z2 | d i )
…
.
…
.
……
P( z1 | di )
classification
23
LPM for IP detection
• Integration of LPM-adapted classification models
with SVM
– Two classification models: DT-ME or Anchor model
– Adapted at segment, utterance, or speaker level
Latent Prosodic Modeling
segment-type utterance-type speaker-type
models
models
models
SVM combining the scores
models
without
LPM
decision
24
LPM for IP detection
• LPM-based feature expansion for DT-ME
• Two sets of features can be used
• The probabilities of each prosodic state related to the
prosodic document:
P( zl | di ), l
(F1)
• The likelihood of the prosodic terms given the prosodic
document:
 P(t
tk di
L
k
| di )    P(tk | zl ) P( zl | di )
(F2)
tk di l 1
25
Experiment Results
26
Corpus Used in the Research
• Mandarin Conversational Dialogue Corpus (MCDC)
• 30 conversational dialogues (27 hours totally)
• 8 dialogues out of the 30 were annotated with disfluencies
– (8.2 hours totally, 9 female & 7 male speakers)
• The summary of experiment data
train
test
Data length
7.1hr
1.1hr
Number of non-IPs
92189
14231
Number of IPs
3569
536
Chance of non-IPs
96.3%
96.4%
27
IP detection results
Recall
Precision
Decision Tree
73.15
73.03
Integrated approach
56.38
81.95
 decision tree achieved moderate and balanced recall and
precision rates
 Integrated approach trades degraded recall for significantly
better precision
 From the purpose of speech recognition, integrated approach is
more appropriate
• Incorrectly detected IP may cause recognition errors
• Missing IP can be processed as usual
28
Analysis of Importance of Roles
for Different Features
29
Identify important features for IP detection
 Exclude each single feature from the full set and then perform
the complete IP detection process
 Detection performance degradation due to the missing of each
single feature is obtained
Original performance
features
Degraded performance
30
Investigation on how the two feature categories
are related to the IP detection
• The most serious performance degradation caused by
removing one single feature from the two categories is shown
abandoned utterances
overt repair
direct repetition
partial repetition
performance
degradation
0.00
-5.00
-10.00
-15.00
-20.00
-25.00
-30.00
pitch-related features
duration-related features
 for overt repair and partial repetition, pitch-related features
play relatively more important role for IP detection
 for direct repetition IP detection, the duration-related features
are more important
 for abandoned utterances IP detection, both features have 31
equally important impact
Importance of each individual pitch-related
feature
 The most serious performance degradation caused by
removing one single pitch-related feature
Disfluency Types
Most Important Features /
(recall degradation)
Second Important Features /
(recall degradation)
1.abandoned utts
(a) / (-17.25)
(b) / (-14.97)
2.overt repairs
(c) / (-26.67)
(a) / (-20.00)
3.direct repetition
(d) / (-5.40)
(e) / (-5.40)
4.partial repetition
(b) / (-18.21)
(f) / (-18.21)
(a)(b)(c)(d)(e)(f): some specific features
 Average pitch value within a syllable (b),(d) 1.3.4.
 Maximum difference of pitch values within a syllable
(e),(f)3.4.
 Magnitude of pitch reset for boundaries (a)1.2.
32
Importance of each individual duration-related
feature
 The most serious performance degradation caused by removing
one single duration-related feature
Disfluency Types
Most Important Features /
(recall degradation)
Second Important Features /
(recall degradation)
1.abandoned utts
(g) / (-17.25)
(h) / (-14.97)
2.overt repairs
(i) / (-13.33)
(j) / (-13.33)
3.direct repetition
(k) / (-8.10)
(l) / (-8.10)
4.partial repetition
(h) / (-16.33)
(m) / (-16.33)
(g)(h)(i)(j)(k)(l)(m) : some specific features
 Jointly considering both the syllable & pause duration is useful
• The ratio of syllable duration to pause duration (g),(h),(k)  1.3.4.
• The product of them (i),(j),(m)2.4.
 The character duration ratio across the boundary (l)3.
 Standard deviation of the product of syllable with pause duration
33
(m)4.
Results for IP detection
34
IP detection experiment
detection accuracy (%)
• Three different feature sets tested80(using
75
Decision tree)
70 in
– [feature set 1] The same as used
65
previous work
– [feature set 2] The same as the60above
but extracted for syllable boundaries
55
– [proposed feature set] Proposed
50
features extracted for syllable45
boundaries
40
ktr
ktr: known transcription
rec: recognition results with errors
rec
feature set 1
ktr
rec
feature set 2
proposed feature set
35
IP detection experiment
detection accuracy (%)
• Comparison of different IP detection approaches
– Using the new feature set proposed here
97.6
97.5
97.4
DT
Maxent
DT-ME
DT-Maxent
97.3
97.2
97.1
97.0
96.9
36
rec
LPM for IP detection
• IP detection accuracy using LPM-based DT-ME or
anchor models
(a)
HAC
plain seg utt spk all
(b)
segment-, utterance- & speaker-type models
kNN
utterance-type model
IP Detection Acc(%)
85
plain
85
84
83
83
82
82
Acc(%)
84
81
81
80
80
79
79
78
78
DT-ME
maxent
anchor
DT-ME
maxent
anchor
37
LPM for IP detection
• (e) yielded the best result, when the finally enhanced
DT-ME model combined with the anchor model by
SVM
IP detection Acc(%)
85
(a) maxent with SVM
combiner
84.5
(F1)
(b): (a) plus P(t|d)features
84
83.5
83
(F1)
+
(F1) (F2) (F2)
82.5
(F2)
(c): (a) plus P(z|d)features
(d): (a) plus P(t|d)-,
(F1)+(F2)
P(z|d)-features
(e): (d) plus anchor
82
various combined methods
38
Conclusion
• A whole set of features for disfluency IP
detection is developed, tested and analyzed.
• The most important features for each
disfluency type were identified and discussed
39
Conclusion
• A new disfluency IP detection model that
incorporates decision trees into a maximum
entropy model was developed
• Latent prosodic modeling for analyzing speech
prosody using a probabilistic framework of
latent prosodic states is proposed to adapt IP
detection models
40
Thanks for your attention
41
Download