Detection of Target* Speakers in **Audio Databases ** Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy * Rice University, Houston, Texas - ** AT&T Labs Research, Florham Park, New Jersey ivan@ieee.org - aer@research.att.com - sps@ research.att.com Database: One-target-speaker detection: • subset aABC_NLI of the HUB4 database (ABC Nightline) • target speaker: Ted Koppel • 3 broadcasts for training the target model • 12 broadcasts for testing (26 to 35 minutes) Two-target-speaker detection: • subset bABC_WNN of the HUB4 database (ABC World News Now) • target speakers: Mark Mullen (T1) and Thalia Assures (T2) • 3 broadcasts for training the target models •16 broadcasts for testing (29 to 31 minutes) • high-fidelity: High fidelity with no background • clean: all quality categories with no background • allspeech: all quality categories with or without background • alldata: previous category + all the untranscribed portions Problem and Definitions: Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds. Speaker detection task - locate and label segments of designated speakers (target speakers) in the data. Overall goal - aid information retrieval from large multimedia databases. Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Results: Detection algorithm: log-likelihood ratio: t T B log R(x | λ ; λ ) log L(x | λ ) log L(x | λ ) t T t B d smoothed log-likelihood ratio every vectors: v log R ( x t0-t , , x t0 t | λ T ; λ B ) with 2t 1 100 (1s) and d 20 (0.2s) high-fidelity 141 min 137 61 min B1 B1,B2 129 129 3.9% 4.0% 10.8% 10.5% 7.3% 7.3% 5.5 / hour 5.5 / hour clean 194 min 257 69 min B1 B1,B2 195 195 4.9% 5.1% 9.2% 8.8% 19.1% 19.5% 8.4 / hour 7.4 / hour allspeech 242 min 318 78 min B1 B1,B2 238 238 6.7% 7.2% 7.2% 7.0% 25.2% 26.4% 6.0 / hour 5.2 / hour alldata 359 min 354 78 min B1 B1,B2 256 256 8.9% 9.6% 5.7% 5.4% 27.4% 29.7% 4.4 / hour 4.2 / hour Results of the one-target-speaker detection experiments segmentation algorithm: Target speakers Segt. duration # target segments Duration FMIR [FCOR] FFAR SMIR SCOR SFAR Modeling: Feature vectors: 20 cepstral coefficients + 20 -cepstral coefficients. Gaussian mixture models : 64 mixtures and diagonal covariance matrices. Target speakers models : • Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together. First background model (B1): • Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI). Second background model (B2): •Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI). Third background model (B3): • 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT). Fourth background model (B4): •23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Quality Total duration # target segments Duration Background # estimated segments FMIR FFAR SMIR SFAR > 4s 111 33 min 17.6% 2.0% 18.0% 9.9 / hour T1 > 2s 153 35 min 20.0% 1.7% 30.7% 6.9 / hour All 225 37 min 21.9% 1.7% 47.6% 5.2 / hour > 4s 225 74 min 13.2% [1.3%] 3.4% 17.3% 0.4 / hour 13.0 / hour T1,T2 > 2s 308 78 min 14.8% [2.2%] 2.9% 26.0% 4.5 / hour 8.8 / hour All 462 80 min 15.7% [3.3%] 2.8% 31.6% 19.3 / hour 8.0 / hour Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Conclusion: Evaluation: Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR) Segment-level Miss Rate (SMIR): # missed segments total # target segments . Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast. Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast. Note 1: This work has been done when the first author was with AT&T Labs Research. A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly. The method works best for uniform quality speaker segments with duration greater than 2 seconds. Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour. Future directions: use more than one model for each target speaker. use more background models. study the performances as a function of the smoothing parameters and the segmentation algorithm parameters. use a new post processor to find the best path through a speaker lattice. Note 2: The first author would like to thank Rice University for financing his conference participation.