224s.14.lec15 - Stanford University

advertisement
CS 224S / LINGUIST 285
Spoken Language Processing
Dan Jurafsky
Stanford University
Spring 2014
Lecture 15: Speaker Recognition
Lots of slides thanks to Douglas Reynolds
Why speaker recognition?
 Access Control
 physical facilities
 websites, computer networks
 Transaction Authentication
 telephone banking
 remove credit card purachse
 Law Enforcement
 forensics
 surveillance
 Speech Data Mining
 meeting summarization
 lecture transcription
slide text from Douglas Reynolds
Speaker Recognition Tasks
Three Speaker Recognition Tasks
Identification
Verification/Authentication/
Detection
?
Whose
Whose voice
voice is
is this?
this?
?
Is
Is this
this Bob’s
Bob’s voice?
voice?
?
?
?
Segmentation and Clustering (Diarization)
Where
Where are
are speaker
speaker
changes?
changes?
Speaker A
Speaker B
slide from Douglas Reynolds
Which
Which segments
segments are
are from
from
the
same
speaker?
the same speaker?
Two kinds of speaker verification
 Text-dependent
 Users have to say something specific
 easier for system
 Text-independent
 Users can say whatever they want
 more flexible but harder
Phases of Speaker Detection System
Two
phases
totospeaker
detection
Two distinct
phases
any speaker verification
system
Enrollment
Phase
Enrollment speech for
Model for each
speaker
each speaker
Bob
Feature
Feature
extraction
extraction
Model
Model
training
training
Sally
Detection
Phase
11
Bob
Sally
Feature
Feature
extraction
extraction
Detection
Detection
decision
decision
Detected!
Hypothesized identity:
Sally
Lincoln Laboratory
slide from Douglas MIT
Reynolds
Detection: Likelihood Ratio
 Two-class hypothesis test:
H0: X is not from the hypothesized speaker
H1: X is from the hypothesized speaker
 Choose the most likely hypothesis
 Likelihood ratio test:
slide from Douglas Reynolds
Speaker ID
Log-Likelihood Ratio Score
LLR= Λ =log p(X|H1) − log p(X|H0)
 Need two models
 Hypothesized speaker model for H1
 Alternative (background) model for H0
slide from Douglas Reynolds
How do we get H1?
 Pool speech from several speakers
and train a single model:
 a universal background model (UBM)
 can train one UBM and use as H1 for all
speakers
 Should be trained using speech representing
the expected impostor speech
 Same type speech as speaker enrollment
(modality, language, channel)
Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, PetrovskaDelacretaz, Reynolds
How to compute P(H|X)?
 Gaussian Mixture Models (GMM)
 The traditional best model for text-
independent speaker recognition
 Support Vector Machines (SVM)
 More recent use of discriminative
model
Speaker Models
Form of Hidden
GMM/HMM
depends
on
Markov Models
application
Form of HMM depends on the application
Fixed Phrase
Word/phrase models
“Open sesame”
Prompted phrases/passwords
/e/
/t/
Text-independent
Phoneme models
/n/
single state HMM (GMM)
General speech
slide from Douglas Reynolds
GMMs for speaker recognition
 A Gaussian mixture model
(GMM) represents features
as the weighted sum of
multiple Gaussian
distributions
Model 
p( x |  )
 Each Gaussian state i has a
μi
 Mean
 Covariance
 Weight
wi
i
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Gaussian Mixture Models
wi
Parameters
μi
p( x )
i
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Gaussian Mixture Models
p( x )
Parameters
Model Components
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
GMM training
Training Features
 During training, the
system learns about the
data it uses to make
decisions
x1
 A set of features are
collected from a speaker
(or language or dialect)
x2
Dim 2
Dim 1
Model
p( x )
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems for Language,
Dialect,
Speaker
ID
Languages,
Dialects,
or Speakers
Model 2
p ( x | C )
Parameters
Model 1
Model 3
Model Components
In LID, DID, and SID,
we train a set of target models C
for each dialect, language, or speaker
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Universal Background Model
p( x | C )
Parameters
Model C
Model Components
We also train a universal background
model C representing all speech
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Hypothesis Test
 Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test  { x1 , x2 ,
Dim 2
H0 :
X test is from the hypothesized class
H1 :
X test is not from the hypothesized class
, xK }
Dim 1
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Recognition Systems
Hypothesis Test
 Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test  { x1 , x2 ,
H0 :
X test is from the hypothesized class
H1 :
X test is not from the hypothesized class
p( x | 1 )
, xK }
H0 ?
Dim 2
Dim 1
p( x | C )
H1 ?
Dim 2
Dim 1
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Hypothesis Test
 Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test  { x1 , x2 ,
p( x | 1 )
, xK }
Dan?
Dim 2
Dim 1
p( x | C )
UBM (not Dan)?
Dim 2
Dim 1
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
More details on GMMs
Adapted GMMs
 Instead of training speaker model on only speaker data
•
•
 Adapt the UBM to that speaker
The
basic advantage
idea is to start
a single background model
 takes
of allwith
the data
that represents general speech
 MAP adaptation: new mean of each Gaussian is a weighted
Using target speaker training data, “tune” the general
mixmodel
of the to
UBM
the speaker
speech
theand
specifics
of the target speaker
–  This
“tuning”
is donemore
via unsupervised
Bayesian
Weigh
the speaker
if we have more
data:adaptation
μi =α Ei(x) + (1−α) μi
x
x
x
x
x
x
x
x
α=n/(n+16)
Target
training
data
x
UBM
Target
Model
Gaussian mixture models
 Features are normal MFCC
 can use more dimensions (20 + deltas)
 UBM background model: 512–2048
mixtures
 Speaker’s GMM: 64–256 mixtures
 Often combined with other classifiers
in mixture-of-experts
SVM
 Train a one-versus-all discriminative classifier
 Various kernels
 Combine with GMM
Other features
 Prosody
 Phone sequences
 Language Model features
Speaker information of word bigrams
Doddington (2001)
Bigram is just the occurrence of two tokens in a sequence
Word bigrams can be very informative about speaker identity
Evaluation Metric
 Trial: Are a pair of audio samples spoken by the
same person?
 Two types of errors:
False reject = Miss: incorrectly reject a true trial
Type-I error
False accept: incorrectly accept false trial
Type-II error
 Performance is trade-off between these two errors
 Controlled by adjustment of the decision threshold
slide from Douglas Reynolds
ROC and DET curves
P(false reject) vs. P(false accept) shows system performance
slide from Douglas Reynolds
DET curve
Application operating point depends on relative costs of the two errors
slide from Douglas Reynolds
Evaluation Design
Data Selection Factors
Evaluation
tasks
•
Performance numbers are only meaningful when evaluation
Performance
depend on evaluation conditions
conditions numbers
are known
Speech quality
–
–
–
Channel and microphone characteristics
Ambient noise level and type
Variability between enrollment and
verification speech
Speech modality
–
–
Fixed/prompted/user-selected phrases
Free text
Speech duration
–
Duration and number of sessions of
enrollment and verification speech
Speaker population
–
–
Size and composition
Experience
The evaluation data and design should match the
target application domain of interest
slide from
Douglas Laboratory
Reynolds
MIT Lincoln
Rough historical trends in performance
slide from Douglas Reynolds
Milestones in the NIST SRE Program
1992 – DARPA: limited speaker id evaluation
1996 – First SRE in current series
2000 – AHUMADA Spanish data, first non-English speech
2001 – Cellular data
2001 – ASR transcripts provided
2002 – FBI “forensic” database
2005 – Mutiple languages with bilingual speakers
2005 – Room mic recordings, cross-channel trials
2008 – Interview data
2010 – New decision cost function: lower FA rate region
2010 – High and low vocal effort, aging
2011 –broad range of conditions, included noise and reverb
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Metrics
 Equal Error Rate
 Easy to understand
 Not operating point of interest
 FA rate at fixed miss rate
 E.g. 10%
 May be viewed as cost of listening to false
alarms
 Decision Cost Function
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Decision Cost Function CDet
Weighted sum of miss and false alarm error
probabilities:
CDet = CMiss × PMiss|Target × PTarget
+ CFalseAlarm× PFalseAlarm|NonTarget × (1PTarget)
 Parameters are the relative costs of detection
errors, CMiss and CFalseAlarm, and the a priori
probability of the specified target speaker, Ptarget:
‘96-’08
2010
CMiss
10
1
CFalseAlarm
1
1
PTarget
0.01
.001
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Accuracies
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
How good are humans?
Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)
 Survey of 2000 voice IDs made by trained FBI employees
 select similarly pronounced words
 use spectrograms (comparing formants, pitch, timing)
 listen back and forth
 Evaluated based on "interviews and other evidence in
the investigation" and legal conclusions
No decision
Non-match
Match
65.2% (1304)
18.8% (378)
15.9% (318)
FR = 0.53% (2)
FA = 0.31% (1)
Speaker diarization
ROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
1557
w of Conversational
Automatic
Speaker
telephone speech
 2 speakers
rization
Systems
 Broadcast news
EEE and Douglas A. Reynolds, Senior Member, IEEE
 Many speakers although often in dialogue (interviews)
or in sequence (broadcast segments).
nnotating an
utes (possibly
their specific
akers, music,
channel charh recognition,
archives, and
making them
Tranter and Reynolds 2006
rview of the
 Meeting
recordings
o diarization,
Fig. 1. Example of audio diarization on broadcast news. Annotated
lative merits
phenomena
may include
different
structuraland
regions
such as commercials,
 Many
speakers,
lots
of
overlap
disfluencies
echniques are different acoustic events such as music or noise, and different speakers. (Color
zation task in version available online at http://ieeexplore.ieee.org.)
Speaker diarization
Tranter and Reynolds 2006
Step 1: Speech Activity Detection
Meetings or broadcast:
 Use supervised GMMs
 two models: speech/non-speech
 or could have extra models for music, etc.
 Then do Viterbi segmentation, possibly with
 minimum length constraints or
 smoothing rules
Telephone
 Simple energy/spectrum speech activity detection
State of the art:
 Broadcast: 1% miss, 1-2% false alarm
 Meeting: 2% miss, 2-3% false alarm
Tranter and Reynolds 2006
Step 2: Change Detection
1. Look at adjacent windows of data
2. Calculate distance between them
3. Decide whether windows come from same source
 Two common methods:
 To look for change points within window use likelihood ratio
test to see if better modeled by one distribution or two.
 If two, insert change and start new window there
 If one, expand window and check again
 represent each window by a Gaussian, compare
neighboring windows with KL distance, find peaks in
distance function, threshold
Tranter and Reynolds 2006
Step 3: Gender Classification
 Supervised GMMs
 If doing Broadcast news, also do bandwidth
classification (studio wideband speech versus
narrowband telephone speech)
Tranter and Reynolds 2006
Step 4: Clustering
Hierarchical agglomerative clustering
1.
2.
3.
4.
5.
initialize leaf clusters of tree with speech segments;
compute pair-wise distances between each cluster;
merge closest clusters;
update distances of remaining clusters to new cluster;
iterate steps 2-4 until stopping criterion is met
Tranter and Reynolds 2006
Step 5: Resegmentation
 Use final clusters and non-speech models
 To resegment data via Viterbi decoding
 Goal:
 refine original segmentation
 fix short segments that may have been removed
Tranter and Reynolds 2006
TDOA features
 For meetings, with multiple-microphones
 Time-Delay-of-Arrival (TDOA) features
 correlate signals from mikes and figure out time shift
 used to sync up multiple microphones
 and as a feature for speaker localization
 assume the speaker doesn’t move, so they are near the same
microphone
Evaluation
 Systems give start-stop times of speech segments with
speaker labels
 nonscoring “collar” of 250 ms on either side
 DER (Diarization Error Rate)
missed speech (% of speech in the ground-truth but not in
the hypothesis)
false alarm speech (% of speech in the hypothesis but not in
the ground-truth)
speaker error (% of speech assigned to the wrong speaker)
Recent mean DER for Multiple Distant Mikes (MDM): 8-10%
Recent mean DER for SDM: 12-18%
Summary:Speaker
Speaker
Recognition
Recognition
Tasks Tasks
Identification
Verification/Authentication/
Detection
?
Whose
Whose voice
voice is
is this?
this?
?
Is
Is this
this Bob’s
Bob’s voice?
voice?
?
?
?
Segmentation and Clustering (Diarization)
Where
Where are
are speaker
speaker
changes?
changes?
Speaker A
Speaker B
slide from Douglas Reynolds
Which
Which segments
segments are
are from
from
the
same
speaker?
the same speaker?
Download