detection-icassp-1999.ppt

advertisement
Detection of Target* Speakers in **Audio Databases
**
Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy
*
Rice University, Houston, Texas -
**
AT&T Labs Research, Florham Park, New Jersey
ivan@ieee.org - aer@research.att.com - sps@ research.att.com
Database:
 One-target-speaker detection:
• subset aABC_NLI of the HUB4 database (ABC
Nightline)
• target speaker: Ted Koppel
• 3 broadcasts for training the target model
• 12 broadcasts for testing (26 to 35 minutes)
 Two-target-speaker detection:
• subset bABC_WNN of the HUB4 database
(ABC World News Now)
• target speakers: Mark Mullen (T1) and Thalia
Assures (T2)
• 3 broadcasts for training the target models
•16 broadcasts for testing (29 to 31 minutes)
• high-fidelity: High fidelity with no background
• clean: all quality categories with no background
• allspeech: all quality categories with or without
background
• alldata: previous category + all the untranscribed
portions
Problem and Definitions:
 Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers
plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have
variable quality and may be contaminated by music, speech, and/or noise backgrounds.
 Speaker detection task - locate and label segments of designated speakers (target speakers) in the data.
 Overall goal - aid information retrieval from large multimedia databases.
 Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material.
Results:
Detection algorithm:
 log-likelihood
ratio:
t
T
B
log R(x | λ ; λ )  log L(x | λ )  log L(x | λ )
t
T
t
B
d
 smoothed log-likelihood
ratio
every
vectors:
v  log R ( x t0-t  ,  , x t0  t  | λ T ; λ B )
with
  2t   1  100
(1s) and
d  20
(0.2s)
high-fidelity
141 min
137
61 min
B1
B1,B2
129
129
3.9%
4.0%
10.8%
10.5%
7.3%
7.3%
5.5 / hour 5.5 / hour
clean
194 min
257
69 min
B1
B1,B2
195
195
4.9%
5.1%
9.2%
8.8%
19.1%
19.5%
8.4 / hour 7.4 / hour
allspeech
242 min
318
78 min
B1
B1,B2
238
238
6.7%
7.2%
7.2%
7.0%
25.2%
26.4%
6.0 / hour 5.2 / hour
alldata
359 min
354
78 min
B1
B1,B2
256
256
8.9%
9.6%
5.7%
5.4%
27.4%
29.7%
4.4 / hour 4.2 / hour
Results of the one-target-speaker detection experiments
 segmentation algorithm:
Target speakers
Segt. duration
# target segments
Duration
FMIR
[FCOR]
FFAR
SMIR
SCOR
SFAR
Modeling:
 Feature vectors:
20 cepstral coefficients + 20 -cepstral coefficients.
 Gaussian mixture models :
64 mixtures and diagonal covariance matrices.
 Target speakers models :
• Three 90s segments of high-fidelity speech,
extracted from 3 broadcasts, concatenated together.
 First background model (B1):
• Eight 60s segments of high-fidelity speech (4
females, 4 males) concatenated together (from
aABC_NLI).
 Second background model (B2):
•Three 90s segments of non-speech data (music only
10%, noise only 10%, commercials 80%), extracted
from 3 broadcasts, concatenated together (from
aABC_NLI).
 Third background model (B3):
• 29 segments (293.5s) of high-fidelity speech (10
females, 10 males) concatenated together (from
cABC_WNT).
 Fourth background model (B4):
•23 segments (561.2s) of non-speech data
(commercials + theme music), extracted from 2
broadcasts, concatenated together (from
bABC_WNN).
Quality
Total duration
# target segments
Duration
Background
# estimated segments
FMIR
FFAR
SMIR
SFAR
> 4s
111
33 min
17.6%
2.0%
18.0%
9.9 / hour
T1
> 2s
153
35 min
20.0%
1.7%
30.7%
6.9 / hour
All
225
37 min
21.9%
1.7%
47.6%
5.2 / hour
> 4s
225
74 min
13.2%
[1.3%]
3.4%
17.3%
0.4 / hour
13.0 / hour
T1,T2
> 2s
308
78 min
14.8%
[2.2%]
2.9%
26.0%
4.5 / hour
8.8 / hour
All
462
80 min
15.7%
[3.3%]
2.8%
31.6%
19.3 / hour
8.0 / hour
Results of the two-target-speaker detection experiments for the alldata category,
using B3,B4 for the background models
Conclusion:
Evaluation:
 Frame-level Miss Rate (FMIR):
# labeled target frames not estimated as target frames
total # labeled target frames
 Frame-level False Alarm Rate (FFAR):
# estimated target frames labeled as non-target frames
total # labeled non-target frames
 Frame-level COnfusion Rate (FCOR):
# labeled target frames estimated as target frames of another speaker
total # labeled target frames
(FCOR is a component of FMIR)
 Segment-level Miss Rate (SMIR):
# missed segments
total # target segments
.
 Segment-level False Alarm Rate (SFAR): # false
alarm segments divided by the total duration of the
broadcast.
 Segment-level COnfusion Rate (SCOR): # confusion
segments divided by the total duration of the broadcast.
Note 1: This work has been done when the first author was with AT&T Labs Research.
 A method for estimating target speaker segments in
multi-speaker audio data using a simple sequential decision
technique has been developed. The method does not require
segregating speech and audio data, and does not require
other speakers in the data to be modeled explicitly.
 The method works best for uniform quality speaker
segments with duration greater than 2 seconds.
 Approximately 70% of target speaker segments with
duration 2 seconds or greater are detected correctly
accompanied by approximately 5 false alarm segments per
hour.
Future directions:
 use more than one model for each target speaker.
 use more background models.
 study the performances as a function of the smoothing
parameters and the segmentation algorithm parameters.
 use a new post processor to find the best path through a
speaker lattice.
Note 2: The first author would like to thank Rice University for financing his conference participation.
Download