Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/~mwmak/ Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010 2 Speaker Verification Process To verify the identify of a claimant based on his/her own voices I am Mary Is this Mary’s voice? 3 Speaker Verification Process A 2-class Hypothesis problem: H0: MFCC sequence X(c) comes from to the true speaker H1: MFCC sequence X(c) comes from an impostor Verification score is a likelihood ratio: Score log p( X (c) p( X (c) | H 0) X (c) | ) log p( X (s) (c) | ( ubm) ) | H 1) (c) Feature extraction X log p( X (c) (c) (s) log p( X | ) Speaker Model ( s ) Score accept Score + Decision − Score reject Background Model ( ubm) log p( X (c) | ( ubm) ) 4 Voice Activity Detection in Speaker Verification Speech VAD Speech segments Feature Extraction Acoustic Features (MFCC) MFCC Log|X(ω)| DCT 5 dim2 Effect of VAD on Acoustic Features Feacture vector: MFCC Non-speech region dim2 Feature Extraction VAD Speech Feacture vector: MFCC dim1 Feature Extraction dim1 6 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010 7 Interview-Speech in NIST SRE Interviewee Desk Interviewer Interview Room Source: NIST SRE 2008 Workshop 8 Interview-Speech in NIST SRE non-speech speech Amplitude Frequency Far-field and desktop microphones were used for collecting interview speech Some interview-speech files are very noisy, causing difficulty in differentiating speech segments from non-speech segments Time A typical interview-speech file in NIST SRE 2008 9 Interview-Speech in NIST SRE Some files have very low SNR Amplitude S: speech h#: non-speech S: speech Frequency Segmentation Amplitude Whole file Time 10 10 Interview-Speech in NIST SRE Some files contain spiky signals, causing wrong VAD decision threshold Spiky signal Amplitude Time 11 Interview-Speech in NIST SRE Some files contain low-energy speech signal superimposed on periodic background noise. Non-speech detected as speech Frequency Segmentation Amplitude Time 12 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010 13 VAD for NIST Speaker Recognition Evaluation Use speech enhancement as a pre-processing step Noisy Speech Denoising (Spectral Subtraction) Denoised Speech Energy-based VAD Speech Segment Info Spectral-Subtraction VAD (SVAD) Feature Extraction S SS S MFCC Scoring Decision Making Accept/Reject S S Speaker Model Impostor Model Decision Threshold 14 VAD for NIST Speaker Recognition Evaluation Use speech enhancement as a pre-processing step Signal Frequency Spectrum Clean speech x(n,m) X(ω,m) Noisy speech y(n,m) Y(ω,m) Background speech b(n,m) B(ω,m) This values were set such that we remove as much noise as possible. 15 VAD for NIST Speaker Recognition Evaluation Without denoising Amplitude Time With denoising Amplitude Time 16 VAD for NIST Speaker Recognition Evaluation Without denoising S: speech h#: non-speech 17 VAD for NIST Speaker Recognition Evaluation VAD in ETSI-AMR speech coder SS-VAD With denoising S: speech h#: non-speech 18 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs Energy-based VAD 6249 Speech Files (NIST’05-08) Energy-based VAD with SS ETSI-AMR Coder Speech / Non-speech Speech / Non-speech Speech / Non-speech total duration: 10 secs . total speech segment: 3 secs. speech-segment-length to speech-file-length ratio = 3/10 19 VAD for NIST Speaker Recognition Evaluation Speech-segment-length to speech-file-length ratio of 3 VADs VAD in ETSI AMR Coder SpectralSubtraction VAD Ordinary Energybased VAD High frequency of occurrence, suggesting many non-speech segments being mistakenly detected as speech segments 20 Outline Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010 21 Experiments on NIST SRE 2008 Dataset NIST’05 & NIST’06 (development) NIST’08 (performance evaluations) Common Condition Train/Test Condition No. of Targets No. of Trials 1 2 All interview speech Interview speech, same microphone type for training and test Interview speech, different microphone types for training and test Interview speech for training, telephone speech for test 622 125 14405 731 622 13674 622 5048 3 4 Speaker Modeling: GMM-SVM Score Normalization: T-norm 22 Results on NIST 2008 SRE ETSI-AMR: VAD in AMR coder Baseline: energy-based VAD without SS (γ=0.99) SS-VAD: spectral subtraction VAD 3.57 > 1.12 (69%) 23 Results on NIST 2008 SRE Common Condition 1 VAD ETSI AMR SS-VAD 24 Preliminary Results on NIST 2010 Common Condition 2: All trials involving interview speech from different microphones EER (%) Normalized minDCF Energy-based VAD 11.72 0.99 SS-VAD 4.45 0.58 SMB 5.83 0.75 SS-SMB 4.62 0.60 NIST ASR Transcripts 8.58 0.85 ETSI-AMR 8.05 0.85 SMB: Statistical-Model Based VAD Sohn, et al. “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, 1999. 25 Conclusions Noise reduction is of primary importance for VAD under extremely low SNR It is important to remove the sinusoidal background found in NIST SRE sound files as this kind of background signal could lead to many false detection in energy-based VAD. Using noise reduction as a pre-preprocessing step leads to a VAD outperforms the VAD in ETSI-AMR (Option 2). 26 VAD for NIST Speaker Recognition Evaluation Threshold Determination and VAD Decision Logic spike Sample-based Windowing Frame-based Amplitude Ranking amplitude ap1 apL μb frame L 500 preset non-speech frames 27 Results To find the optimum weighting factor, γ 28 Experiments on NIST SRE 2008 Training phase utt bkg (NIST’05 & 06) Feature Extraction utt spk (NIST’08) Feature Extraction Model Creation UBM MAP Adaptation GMM-supervectors of target speakers NAP MAP Adaptation 300 background speakers (NIST’06) GMM-supervectors of 300 impostors NAP SVM Training spk GMM-SVM 29 Experiments on NIST SRE 2008 Verification phase MFCCs of a test utterance from claimant c X (c) MAP and Mean Stacking Sessiondependent supervector UBM ( c ,h ) m Tnorm Models NAP Sessionindependent supervector SVM of targetspeaker s (c) m score SVM Scoring T-Norm S(X (c) ) ~ S (X Normalized score (c) ) 30