CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynolds Why speaker recognition? Access Control physical facilities websites, computer networks Transaction Authentication telephone banking remove credit card purachse Law Enforcement forensics surveillance Speech Data Mining meeting summarization lecture transcription slide text from Douglas Reynolds Speaker Recognition Tasks Three Speaker Recognition Tasks Identification Verification/Authentication/ Detection ? Whose Whose voice voice is is this? this? ? Is Is this this Bob’s Bob’s voice? voice? ? ? ? Segmentation and Clustering (Diarization) Where Where are are speaker speaker changes? changes? Speaker A Speaker B slide from Douglas Reynolds Which Which segments segments are are from from the same speaker? the same speaker? Two kinds of speaker verification Text-dependent Users have to say something specific easier for system Text-independent Users can say whatever they want more flexible but harder Phases of Speaker Detection System Two phases totospeaker detection Two distinct phases any speaker verification system Enrollment Phase Enrollment speech for Model for each speaker each speaker Bob Feature Feature extraction extraction Model Model training training Sally Detection Phase 11 Bob Sally Feature Feature extraction extraction Detection Detection decision decision Detected! Hypothesized identity: Sally Lincoln Laboratory slide from Douglas MIT Reynolds Detection: Likelihood Ratio Two-class hypothesis test: H0: X is not from the hypothesized speaker H1: X is from the hypothesized speaker Choose the most likely hypothesis Likelihood ratio test: slide from Douglas Reynolds Speaker ID Log-Likelihood Ratio Score LLR= Λ =log p(X|H1) − log p(X|H0) Need two models Hypothesized speaker model for H1 Alternative (background) model for H0 slide from Douglas Reynolds How do we get H1? Pool speech from several speakers and train a single model: a universal background model (UBM) can train one UBM and use as H1 for all speakers Should be trained using speech representing the expected impostor speech Same type speech as speaker enrollment (modality, language, channel) Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, PetrovskaDelacretaz, Reynolds How to compute P(H|X)? Gaussian Mixture Models (GMM) The traditional best model for text- independent speaker recognition Support Vector Machines (SVM) More recent use of discriminative model Speaker Models Form of Hidden GMM/HMM depends on Markov Models application Form of HMM depends on the application Fixed Phrase Word/phrase models “Open sesame” Prompted phrases/passwords /e/ /t/ Text-independent Phoneme models /n/ single state HMM (GMM) General speech slide from Douglas Reynolds GMMs for speaker recognition A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions Model p( x | ) Each Gaussian state i has a μi Mean Covariance Weight wi i Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems Gaussian Mixture Models wi Parameters μi p( x ) i Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems Gaussian Mixture Models p( x ) Parameters Model Components Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 GMM training Training Features During training, the system learns about the data it uses to make decisions x1 A set of features are collected from a speaker (or language or dialect) x2 Dim 2 Dim 1 Model p( x ) Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems for Language, Dialect, Speaker ID Languages, Dialects, or Speakers Model 2 p ( x | C ) Parameters Model 1 Model 3 Model Components In LID, DID, and SID, we train a set of target models C for each dialect, language, or speaker Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems Universal Background Model p( x | C ) Parameters Model C Model Components We also train a universal background model C representing all speech Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it X test { x1 , x2 , Dim 2 H0 : X test is from the hypothesized class H1 : X test is not from the hypothesized class , xK } Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it X test { x1 , x2 , H0 : X test is from the hypothesized class H1 : X test is not from the hypothesized class p( x | 1 ) , xK } H0 ? Dim 2 Dim 1 p( x | C ) H1 ? Dim 2 Dim 1 Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it X test { x1 , x2 , p( x | 1 ) , xK } Dan? Dim 2 Dim 1 p( x | C ) UBM (not Dan)? Dim 2 Dim 1 Dim 2 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1 More details on GMMs Adapted GMMs Instead of training speaker model on only speaker data • • Adapt the UBM to that speaker The basic advantage idea is to start a single background model takes of allwith the data that represents general speech MAP adaptation: new mean of each Gaussian is a weighted Using target speaker training data, “tune” the general mixmodel of the to UBM the speaker speech theand specifics of the target speaker – This “tuning” is donemore via unsupervised Bayesian Weigh the speaker if we have more data:adaptation μi =α Ei(x) + (1−α) μi x x x x x x x x α=n/(n+16) Target training data x UBM Target Model Gaussian mixture models Features are normal MFCC can use more dimensions (20 + deltas) UBM background model: 512–2048 mixtures Speaker’s GMM: 64–256 mixtures Often combined with other classifiers in mixture-of-experts SVM Train a one-versus-all discriminative classifier Various kernels Combine with GMM Other features Prosody Phone sequences Language Model features Speaker information of word bigrams Doddington (2001) Bigram is just the occurrence of two tokens in a sequence Word bigrams can be very informative about speaker identity Evaluation Metric Trial: Are a pair of audio samples spoken by the same person? Two types of errors: False reject = Miss: incorrectly reject a true trial Type-I error False accept: incorrectly accept false trial Type-II error Performance is trade-off between these two errors Controlled by adjustment of the decision threshold slide from Douglas Reynolds ROC and DET curves P(false reject) vs. P(false accept) shows system performance slide from Douglas Reynolds DET curve Application operating point depends on relative costs of the two errors slide from Douglas Reynolds Evaluation Design Data Selection Factors Evaluation tasks • Performance numbers are only meaningful when evaluation Performance depend on evaluation conditions conditions numbers are known Speech quality – – – Channel and microphone characteristics Ambient noise level and type Variability between enrollment and verification speech Speech modality – – Fixed/prompted/user-selected phrases Free text Speech duration – Duration and number of sessions of enrollment and verification speech Speaker population – – Size and composition Experience The evaluation data and design should match the target application domain of interest slide from Douglas Laboratory Reynolds MIT Lincoln Rough historical trends in performance slide from Douglas Reynolds Milestones in the NIST SRE Program 1992 – DARPA: limited speaker id evaluation 1996 – First SRE in current series 2000 – AHUMADA Spanish data, first non-English speech 2001 – Cellular data 2001 – ASR transcripts provided 2002 – FBI “forensic” database 2005 – Mutiple languages with bilingual speakers 2005 – Room mic recordings, cross-channel trials 2008 – Interview data 2010 – New decision cost function: lower FA rate region 2010 – High and low vocal effort, aging 2011 –broad range of conditions, included noise and reverb From Alvin Martin’s 2012 talk on the NIST SR Evaluations Metrics Equal Error Rate Easy to understand Not operating point of interest FA rate at fixed miss rate E.g. 10% May be viewed as cost of listening to false alarms Decision Cost Function From Alvin Martin’s 2012 talk on the NIST SR Evaluations Decision Cost Function CDet Weighted sum of miss and false alarm error probabilities: CDet = CMiss × PMiss|Target × PTarget + CFalseAlarm× PFalseAlarm|NonTarget × (1PTarget) Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, Ptarget: ‘96-’08 2010 CMiss 10 1 CFalseAlarm 1 1 PTarget 0.01 .001 From Alvin Martin’s 2012 talk on the NIST SR Evaluations Accuracies From Alvin Martin’s 2012 talk on the NIST SR Evaluations How good are humans? Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6) Survey of 2000 voice IDs made by trained FBI employees select similarly pronounced words use spectrograms (comparing formants, pitch, timing) listen back and forth Evaluated based on "interviews and other evidence in the investigation" and legal conclusions No decision Non-match Match 65.2% (1304) 18.8% (378) 15.9% (318) FR = 0.53% (2) FA = 0.31% (1) Speaker diarization ROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006 1557 w of Conversational Automatic Speaker telephone speech 2 speakers rization Systems Broadcast news EEE and Douglas A. Reynolds, Senior Member, IEEE Many speakers although often in dialogue (interviews) or in sequence (broadcast segments). nnotating an utes (possibly their specific akers, music, channel charh recognition, archives, and making them Tranter and Reynolds 2006 rview of the Meeting recordings o diarization, Fig. 1. Example of audio diarization on broadcast news. Annotated lative merits phenomena may include different structuraland regions such as commercials, Many speakers, lots of overlap disfluencies echniques are different acoustic events such as music or noise, and different speakers. (Color zation task in version available online at http://ieeexplore.ieee.org.) Speaker diarization Tranter and Reynolds 2006 Step 1: Speech Activity Detection Meetings or broadcast: Use supervised GMMs two models: speech/non-speech or could have extra models for music, etc. Then do Viterbi segmentation, possibly with minimum length constraints or smoothing rules Telephone Simple energy/spectrum speech activity detection State of the art: Broadcast: 1% miss, 1-2% false alarm Meeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006 Step 2: Change Detection 1. Look at adjacent windows of data 2. Calculate distance between them 3. Decide whether windows come from same source Two common methods: To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window there If one, expand window and check again represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold Tranter and Reynolds 2006 Step 3: Gender Classification Supervised GMMs If doing Broadcast news, also do bandwidth classification (studio wideband speech versus narrowband telephone speech) Tranter and Reynolds 2006 Step 4: Clustering Hierarchical agglomerative clustering 1. 2. 3. 4. 5. initialize leaf clusters of tree with speech segments; compute pair-wise distances between each cluster; merge closest clusters; update distances of remaining clusters to new cluster; iterate steps 2-4 until stopping criterion is met Tranter and Reynolds 2006 Step 5: Resegmentation Use final clusters and non-speech models To resegment data via Viterbi decoding Goal: refine original segmentation fix short segments that may have been removed Tranter and Reynolds 2006 TDOA features For meetings, with multiple-microphones Time-Delay-of-Arrival (TDOA) features correlate signals from mikes and figure out time shift used to sync up multiple microphones and as a feature for speaker localization assume the speaker doesn’t move, so they are near the same microphone Evaluation Systems give start-stop times of speech segments with speaker labels nonscoring “collar” of 250 ms on either side DER (Diarization Error Rate) missed speech (% of speech in the ground-truth but not in the hypothesis) false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker) Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18% Summary:Speaker Speaker Recognition Recognition Tasks Tasks Identification Verification/Authentication/ Detection ? Whose Whose voice voice is is this? this? ? Is Is this this Bob’s Bob’s voice? voice? ? ? ? Segmentation and Clustering (Diarization) Where Where are are speaker speaker changes? changes? Speaker A Speaker B slide from Douglas Reynolds Which Which segments segments are are from from the same speaker? the same speaker?