Presentation - EIE

advertisement
Robust Voice Activity
Detection for Interview
Speech in NIST Speaker
Recognition Evaluation
Man-Wai MAK and Hon-Bill YU
The Hong Kong Polytechnic University
enmwmak@polyu.edu.hk
http://www.eie.polyu.edu.hk/~mwmak/
Outline

Speaker Verification



Speaker Verification Process
Voice Activity Detection (VAD) in Speaker Verification
Effect of VAD on Acoustic Features

Characteristics of Interview-Speech in NIST Speaker
Recognition Evaluation

VAD for NIST Speaker Recognition Evaluation

Experiments on NIST SRE 2008

Preliminary Results on NIST SRE 2010
2
Speaker Verification Process
 To verify the identify of a claimant based on
his/her own voices
I am Mary
Is this
Mary’s
voice?
3
Speaker Verification Process


A 2-class Hypothesis problem:
H0: MFCC sequence X(c) comes from to the true speaker
H1: MFCC sequence X(c) comes from an impostor
Verification score is a likelihood ratio:
Score  log
p( X
(c)
p( X
(c)
| H 0)
X
(c)
|  )  log p( X
(s)
(c)
|
( ubm)
)
| H 1)
(c)
Feature
extraction
X
 log p( X
(c)
(c)
(s)
log p( X |  )
Speaker
Model ( s )
Score   accept
Score
+
Decision

−
Score   reject
Background
Model ( ubm)
log p( X
(c)
|
( ubm)
)

4
Voice Activity Detection in Speaker Verification
Speech
VAD
Speech
segments
Feature
Extraction
Acoustic Features
(MFCC)
MFCC
Log|X(ω)|
DCT
5
dim2
Effect of VAD on Acoustic Features
Feacture vector: MFCC
Non-speech region
dim2
Feature
Extraction
VAD
Speech
Feacture vector: MFCC
dim1
Feature
Extraction
dim1
6
Outline

Speaker Verification



Speaker Verification Process
Voice Activity Detection (VAD) in Speaker Verification
Effect of VAD on Acoustic Features

Characteristics of Interview-Speech in NIST Speaker
Recognition Evaluation

VAD for NIST Speaker Recognition Evaluation

Experiments on NIST SRE 2008

Preliminary Results on NIST SRE 2010
7
Interview-Speech in NIST SRE
Interviewee
Desk
Interviewer
Interview Room
Source: NIST SRE 2008 Workshop
8
Interview-Speech in NIST SRE

non-speech
speech
Amplitude
Frequency

Far-field and desktop microphones were used for collecting
interview speech
Some interview-speech files are very noisy, causing difficulty in
differentiating speech segments from non-speech segments
Time
A typical interview-speech file in NIST SRE 2008
9
Interview-Speech in NIST SRE
Some files have very low SNR
Amplitude

S: speech
h#: non-speech
S: speech
Frequency
Segmentation
Amplitude
Whole file
Time
10
10
Interview-Speech in NIST SRE
Some files contain spiky signals, causing wrong
VAD decision threshold
Spiky signal
Amplitude

Time
11
Interview-Speech in NIST SRE
Some files contain low-energy speech signal superimposed on
periodic background noise.
Non-speech
detected as speech
Frequency
Segmentation
Amplitude

Time
12
Outline

Speaker Verification



Speaker Verification Process
Voice Activity Detection (VAD) in Speaker Verification
Effect of VAD on Acoustic Features

Characteristics of Interview-Speech in NIST Speaker
Recognition Evaluation

VAD for NIST Speaker Recognition Evaluation

Experiments on NIST SRE 2008

Preliminary Results on NIST SRE 2010
13
VAD for NIST Speaker Recognition Evaluation
Use speech enhancement as a pre-processing step

Noisy Speech
Denoising
(Spectral Subtraction)
Denoised Speech
Energy-based VAD
Speech
Segment Info
Spectral-Subtraction VAD (SVAD)
Feature
Extraction
S
SS
S
MFCC
Scoring
Decision
Making
Accept/Reject
S S
Speaker
Model
Impostor
Model
Decision
Threshold
14
VAD for NIST Speaker Recognition Evaluation

Use speech enhancement as a pre-processing step
Signal
Frequency
Spectrum
Clean speech
x(n,m)
X(ω,m)
Noisy speech
y(n,m)
Y(ω,m)
Background
speech
b(n,m)
B(ω,m)
This values were set such that we remove as much noise as possible.
15
VAD for NIST Speaker Recognition Evaluation
Without denoising
Amplitude

Time
With denoising
Amplitude

Time
16
VAD for NIST Speaker Recognition Evaluation

Without denoising
S: speech
h#: non-speech
17
VAD for NIST Speaker Recognition Evaluation
VAD in ETSI-AMR speech coder
SS-VAD
With denoising
S: speech
h#: non-speech
18
VAD for NIST Speaker Recognition Evaluation

Speech-segment-length to speech-file-length ratio of 3
VADs
Energy-based VAD
6249 Speech Files
(NIST’05-08)
Energy-based VAD with SS
ETSI-AMR Coder
Speech /
Non-speech
Speech /
Non-speech
Speech /
Non-speech
total duration: 10 secs
.
total speech segment: 3 secs.
speech-segment-length to speech-file-length ratio = 3/10
19
VAD for NIST Speaker Recognition Evaluation

Speech-segment-length to speech-file-length ratio of 3
VADs
VAD in ETSI AMR
Coder
SpectralSubtraction VAD
Ordinary Energybased VAD
High frequency of
occurrence,
suggesting many
non-speech
segments being
mistakenly detected
as speech segments
20
Outline

Speaker Verification



Speaker Verification Process
Voice Activity Detection (VAD) in Speaker Verification
Effect of VAD on Acoustic Features

Characteristics of Interview-Speech in NIST Speaker
Recognition Evaluation

VAD for NIST Speaker Recognition Evaluation

Experiments on NIST SRE 2008

Preliminary Results on NIST SRE 2010
21
Experiments on NIST SRE 2008

Dataset


NIST’05 & NIST’06 (development)
NIST’08 (performance evaluations)
Common Condition
Train/Test Condition
No. of Targets
No. of Trials
1
2
All interview speech
Interview speech, same
microphone type for training
and test
Interview speech, different
microphone types for
training and test
Interview speech for training,
telephone speech for test
622
125
14405
731
622
13674
622
5048
3
4


Speaker Modeling: GMM-SVM
Score Normalization: T-norm
22
Results on NIST 2008 SRE

ETSI-AMR: VAD in AMR coder

Baseline: energy-based VAD without SS (γ=0.99)
SS-VAD: spectral subtraction VAD

3.57 > 1.12 (69%)
23
Results on NIST 2008 SRE
Common Condition 1
VAD
ETSI AMR
SS-VAD
24
Preliminary Results on NIST 2010
Common Condition 2: All trials involving interview speech from
different microphones
EER (%)
Normalized
minDCF
Energy-based VAD
11.72
0.99
SS-VAD
4.45
0.58
SMB
5.83
0.75
SS-SMB
4.62
0.60
NIST ASR Transcripts
8.58
0.85
ETSI-AMR
8.05
0.85
SMB: Statistical-Model Based VAD
Sohn, et al. “A statistical model-based voice activity detection”, IEEE
Signal Processing Letters, 1999.
25
Conclusions



Noise reduction is of primary importance for VAD under
extremely low SNR
It is important to remove the sinusoidal background found
in NIST SRE sound files as this kind of background signal
could lead to many false detection in energy-based VAD.
Using noise reduction as a pre-preprocessing step leads
to a VAD outperforms the VAD in ETSI-AMR (Option 2).
26
VAD for NIST Speaker Recognition Evaluation

Threshold Determination and VAD Decision Logic
spike
Sample-based
Windowing
Frame-based
Amplitude
Ranking
amplitude
ap1
apL
μb
frame
L
500 preset non-speech frames
27
Results

To find the optimum weighting factor, γ
28
Experiments on NIST SRE 2008

Training phase
utt bkg
(NIST’05 & 06)
Feature
Extraction
utt spk
(NIST’08)
Feature
Extraction
Model
Creation
UBM
MAP
Adaptation
GMM-supervectors
of target speakers
NAP
MAP
Adaptation
300
background
speakers
(NIST’06)
GMM-supervectors
of 300 impostors
NAP
SVM
Training
 spk
GMM-SVM
29
Experiments on NIST SRE 2008

Verification phase
MFCCs of a
test utterance
from claimant c
X
(c)
MAP and
Mean Stacking
Sessiondependent
supervector
UBM
 ( c ,h )
m
Tnorm
Models
NAP
Sessionindependent
supervector
SVM of targetspeaker s
 (c)
m
score
SVM Scoring
T-Norm
S(X
(c)
)
~
S (X
Normalized
score
(c)
)
30
Download