detection-icassp-1999.ppt

Detection of Target* Speakers in **Audio Databases ** Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy * Rice University, Houston, Texas - ** AT&T Labs Research, Florham Park, New Jersey ivan@ieee.org - aer@research.att.com - sps@ research.att.com Database:  One-target-speaker detection: • subset aABC_NLI of the HUB4 database (ABC Nightline) • target speaker: Ted Koppel • 3 broadcasts for training the target model • 12 broadcasts for testing (26 to 35 minutes)  Two-target-speaker detection: • subset bABC_WNN of the HUB4 database (ABC World News Now) • target speakers: Mark Mullen (T1) and Thalia Assures (T2) • 3 broadcasts for training the target models •16 broadcasts for testing (29 to 31 minutes) • high-fidelity: High fidelity with no background • clean: all quality categories with no background • allspeech: all quality categories with or without background • alldata: previous category + all the untranscribed portions Problem and Definitions:  Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds.  Speaker detection task - locate and label segments of designated speakers (target speakers) in the data.  Overall goal - aid information retrieval from large multimedia databases.  Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Results: Detection algorithm:  log-likelihood ratio: t T B log R(x | λ ; λ )  log L(x | λ )  log L(x | λ ) t T t B d  smoothed log-likelihood ratio every vectors: v  log R ( x t0-t  ,  , x t0  t  | λ T ; λ B ) with   2t   1  100 (1s) and d  20 (0.2s) high-fidelity 141 min 137 61 min B1 B1,B2 129 129 3.9% 4.0% 10.8% 10.5% 7.3% 7.3% 5.5 / hour 5.5 / hour clean 194 min 257 69 min B1 B1,B2 195 195 4.9% 5.1% 9.2% 8.8% 19.1% 19.5% 8.4 / hour 7.4 / hour allspeech 242 min 318 78 min B1 B1,B2 238 238 6.7% 7.2% 7.2% 7.0% 25.2% 26.4% 6.0 / hour 5.2 / hour alldata 359 min 354 78 min B1 B1,B2 256 256 8.9% 9.6% 5.7% 5.4% 27.4% 29.7% 4.4 / hour 4.2 / hour Results of the one-target-speaker detection experiments  segmentation algorithm: Target speakers Segt. duration # target segments Duration FMIR [FCOR] FFAR SMIR SCOR SFAR Modeling:  Feature vectors: 20 cepstral coefficients + 20 -cepstral coefficients.  Gaussian mixture models : 64 mixtures and diagonal covariance matrices.  Target speakers models : • Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together.  First background model (B1): • Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI).  Second background model (B2): •Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI).  Third background model (B3): • 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT).  Fourth background model (B4): •23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Quality Total duration # target segments Duration Background # estimated segments FMIR FFAR SMIR SFAR > 4s 111 33 min 17.6% 2.0% 18.0% 9.9 / hour T1 > 2s 153 35 min 20.0% 1.7% 30.7% 6.9 / hour All 225 37 min 21.9% 1.7% 47.6% 5.2 / hour > 4s 225 74 min 13.2% [1.3%] 3.4% 17.3% 0.4 / hour 13.0 / hour T1,T2 > 2s 308 78 min 14.8% [2.2%] 2.9% 26.0% 4.5 / hour 8.8 / hour All 462 80 min 15.7% [3.3%] 2.8% 31.6% 19.3 / hour 8.0 / hour Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Conclusion: Evaluation:  Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames  Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames  Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR)  Segment-level Miss Rate (SMIR): # missed segments total # target segments .  Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast.  Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast. Note 1: This work has been done when the first author was with AT&T Labs Research.  A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly.  The method works best for uniform quality speaker segments with duration greater than 2 seconds.  Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour. Future directions:  use more than one model for each target speaker.  use more background models.  study the performances as a function of the smoothing parameters and the segmentation algorithm parameters.  use a new post processor to find the best path through a speaker lattice. Note 2: The first author would like to thank Rice University for financing his conference participation.

detection-icassp-1999.ppt

Related documents

Products

Support

detection-icassp-1999.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib