An Intro to Speaker Recognition Nikki Mirghafori Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke. Today’s class • Interactive • Measures of success for today: • • • • You talk at least as much as I do You learn and remember the basics You feel you can do this stuff We all have fun with the material! Nikki Mirghafori EECS 225D -- Verification 2 4/23/12 A 10-minute “Project Design” • • You are experts with different backgrounds. Your previous startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition. The VC funding is yours, if you come up with some kind of a coherent plan/list of issues: • • • • • • What is your proposed application? What will be the sources of error and variability, i.e., technology challenges? What types of features will you use? What sorts of statistical modeling tools/techniques? What will be your data needs? Any other issues you can think of along your path? Nikki Mirghafori EECS 225D -- Verification 3 4/23/12 Extracting Information from Speech • • • What’s noise? what’s signal? Orthogonal in many ways Use many of the same models and tools Speech Signal Goal: Automatically extract information transmitted in speech signal Speech Recognition Language Recognition Speaker Recognition Nikki Mirghafori EECS 225D -- Verification Words “How are you?” Language Name English Speaker Name James Wilson 4/23/12 Speaker Recognition Applications • • • • • Access control – Physical facilities – Data and data networks Transaction authentication – Telephone credit card purchases – Bank wire transfers – Fraud detection Monitoring – Remote time and attendance logging – Home parole verification Information retrieval – Customer information for call centers – Audio indexing (speech skimming device) – Personalization Forensics – Voice sample matching Nikki Mirghafori EECS 225D -- Verification 5 4/23/12 Tasks • Identification vs. verification • Closed set vs. open set identification • Also, segmentation, clustering, tracking... Nikki Mirghafori EECS 225D -- Verification 6 4/23/12 Identification Test Speech Speaker Model Database Whose voice is it? Closed-set Speaker Identification Nikki Mirghafori EECS 225D -- Verification 7 4/23/12 Identification Test Speech Speaker Model Database Whose voice is it? Open-set Speaker Identification Nikki Mirghafori None of the above EECS 225D -- Verification 8 4/23/12 Verification/Authentication/Detectio n Speaker Model Database Test Speech “It’s me!” Does the voice match? Verification requires claimant ID Nikki Mirghafori Yes/No EECS 225D -- Verification 9 4/23/12 Speech Modalities • Text-dependent recognition – Recognition system knows text spoken by person – Examples: fixed phrase, prompted phrase – Used for applications with strong control over user input – Knowledge of spoken text can improve system performance • Text-independent recognition – Recognition system does not know text spoken by person – Examples: User selected phrase, conversational speech – Used for applications with less control over user input – More flexible system but also more difficult problem – Speech recognition can provide knowledge of spoken text – Text-Constrained recognition. Exercise for the reader. Nikki Mirghafori EECS 225D -- Verification 4/23/12 Text-constrained Recognition • Basic idea: build speaker models for words rich in speaker information • Example: •“What time did you say? that’s a good plan.” um... okay, I_think • Text-dependent strategy in a textindependent context Nikki Mirghafori EECS 225D -- Verification 11 4/23/12 Voice as a biometric • Biometric: a human generated signal or attribute for authenticating a person’s identity • Voice is a popular biometric: – natural signal to produce – does not require a specialized input device – ubiquitous: telephones and microphone equipped PC • Voice biometric with other forms of security – Something you have - e.g., badge Strongest security Are – Something you know - e.g., password – Something you are - e.g., voice Know Nikki Mirghafori EECS 225D -- Verification 4/23/12 Have How to build a system? • Feature choices: • • Types of models: • • • low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...) HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc. Nikki Mirghafori EECS 225D -- Verification 13 4/23/12 Verification Performance • • There are many factors to consider in design of an evaluation of a speaker verification system Speech quality – – – Channel and microphone characteristics Noise level and type Variability between enrollment and verification speech Speech modality – – Fixed/prompted/user-selected phrases Free text Speech duration – Duration and number of sessions of enrollment and verification speech Speaker population – Size and composition Most importantly: The evaluation data and design should match the target application domain of interest Nikki Mirghafori EECS 225D -- Verification 4/23/12 Verification Performance Text-independent (Read sentences) Probability of False Reject (in %) Military radio Data Multiple radios & microphones Moderate amount of training data Text-independent (Conversational) Telephone Data Text-dependent (Combinations) Multiple microphones Clean Data Single microphone Large amount of train/test speech Text-dependent (Digit strings) Moderate amount of training data Telephone Data Multiple microphones Nikki Mirghafori Probability of False of Accept (in %) Small amount EECS 225Ddata -- Verification training 4/23/12 Verification Performance PROBABILITY OF FALSE REJECT (in %) Example Performance Curve Wire Transfer: Application operating point depends on relative costs of the two error types False acceptance is very costly Users may tolerate rejections for security High Security Equal Error Rate (EER) = 1 % Balance Customization: High Convenience False rejections alienate customers Any customization is beneficial PROBABILITY OF FALSE ACCEPT (in %) Nikki Mirghafori EECS 225D -- Verification 4/23/12 Human vs. Machine • Motivation for comparing human Humans 15% worse to machine – Evaluating speech coders and potential forensic applications Humans 44% better • Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) – Same amount of training data Error Rates – Matched Handset-type tests – Mismatched Handset-type tests – Used 3-sec conversational utterances from telephone speech Nikki Mirghafori EECS 225D -- Verification 17 4/23/12 Features • Desirable attributes of features for an automatic system (Wolf ‘72) Practical Robust Secure • • • • • Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speaker’s health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicry • No feature has all these attributes Nikki Mirghafori EECS 225D -- Verification 4/23/12 Training & Test Phases Enrollment Phase Feature Extraction Model Training Model for each speaker Training speech for each speaker Recognition Phase ? Verificatio n Decision Feature Extraction Rejecte d (e.g. Verification) Accepted “It’s me!” Nikki Mirghafori EECS 225D -- Verification 19 4/23/12 Decision making Verification decision approaches have roots in signal detection theory • 2-class Hypothesis test: H0: the speaker is an impostor H1: the speaker is indeed the claimed speaker. • Statistic computed on test utterance S as likelihood ratio: L = log Likelihood S came from speaker model Likelihood S did not come from speaker model Speaker Model + Feature extraction - L S Decision L> q accept L< q reject Impostor Model Nikki Mirghafori EECS 225D -- Verification 4/23/12 Decision making • • Identification: pick model (of N) with best score Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.: • By Bayes: • P(target|x)/P(nontarget|x) = P(x|target)P(target)/P(x|nontarget)P(nontarget) • • accept if > threshold, reject otherwise Can’t sum over all non-target talkers -- world for SV! • • Use “cohorts” (collection of impostors) Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers) Nikki Mirghafori EECS 225D -- Verification 21 4/23/12 Spectral Based Approach • Traditional speaker recognition systems use • Cepstral feaures • Gaussian Mixture Models (GMMs) Feature Extractio n Sliding window Fourier Transform Log Nikki Mirghafori Magnitud e Speaker Model Adapt Backgroun d Model Cosine Transform log likelihood ratio D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1-3), January/April/July 2000 EECS 225D -- Verification 22 4/23/12 Features: Levels of Information Hierarchy of Perceptual Cues semantics, idiolects, pronunciations, idiosyncrasies High-level cues (learned behaviors) socio-economic status, education, place of birth Dialogic Idiolectal prosody, rhythm, speed intonation, personality type, volume parental influence modulation acoustic aspects of speech, nasal, deep, breathy, rough Nikki Mirghafori anatomical structure of vocal apparatus Semantic Phonetic Prosodic Low-level cues (physical characteristics) EECS 225D -- Verification 23 Spectral 4/23/12 Low level features • Speech production model: source-filter interaction • Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Glottal pulses Nikki Mirghafori Vocal tract EECS 225D -- Verification Speech signal 4/23/12 Word N-gram Features Idea (Doddington 2001): • Word usage can be idiosyncratic to a speaker • Model speakers by relative frequencies of word N-grams • Reflects vocabulary AND grammar • Cf. similar approaches for authorship and plagiarism detection on text documents. • First (unpublished) use in speaker recognition: Heck et al. (1998) Implementation: • Get 1-best word recognition output • Extract N-gram frequencies • Model likelihood ratio OR • Model frequency vectors by SVM Nikki Mirghafori EECS 225D -- Verification 25 I_shall 0.002 I_think 0.025 I_would 0.012 … … 4/23/12 Phone N-gram features Model the pattern of phone usage or “short term pronunciation” for a speaker phone lattice Open-loop phone recognition jh zh eh k [+ 0.0254 0.0068 0.0198] [- 0.0001 0.8827 0.7264] [- 0.0329 0.2847 0.2983] Nikki Mirghafori Support Vector Machine (SVM) EECS 225D -- Verification 26 phone relative ngram freq. jh 0.0254 zh eh 0.0068 k 0.0198 score 4/23/12 MLLR transform vectors as features Speaker-dependent Speaker-independent Phone class B Phone class A Speaker-independent Speaker-dependent MLLR Transforms = Features Nikki Mirghafori EECS 225D -- Verification 27 4/23/12 Models • HMMs: • text dep (could use whole word/phone model) • • • • • • prompted (phone models) text ind’t (use LVCSR) -- or GMMs! templates DTW (if text-dependent system) nearest neighbor: frame level, training data as “model”, non-parametric neural nets: train explicitly discriminating models SVMs Nikki Mirghafori EECS 225D -- Verification 28 4/23/12 Speaker Models -- HMM • Speaker models (voiceprints) represent voice biometric in compact and generalizable form • Modern speaker verification systems use Hidden Markov Models (HMMs) – HMMs are statistical models of how a speaker produces sounds h-a-d – HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. – Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties. Nikki Mirghafori EECS 225D -- Verification 4/23/12 Speaker Models – HMM/GMM Form of HMM depends on the application Fixed Phrase Word/phrase models “Open sesame” Prompted phrases/passwords /s/ /i/ Text-independent Phoneme models /x/ single state HMM General speech Nikki Mirghafori EECS 225D -- Verification 4/23/12 Word N-gram Modeling: Likelihood Ratios • • • • Model N-gram token log likelihood ratio Numerator: speaker language model estimated from enrollment data Denominator: background language model estimated from large speaker population Normalize by token count L (j) Spea ker log L (j) j Background Score = 1 j • Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both Nikki Mirghafori EECS 225D -- Verification 31 4/23/12 Speaker Recognition with SVMs • Each speech sample (training or test) generates a point in a derived feature space • The SVM is trained to separate the target sample from the impostor (= UBM) samples • Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point • SVMs training is biased against misclassifying positive examples (typically very few, often just 1) Background sample Target sample Test sample Nikki Mirghafori EECS 225D -- Verification 32 4/23/12 Feature Transforms for SVMs • • • • • SVMs have been a boon for higher-level (as well as cepstral speaker recognition) research – they allow great flexibility in the choice of features However, we need a “sequence kernel” Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space Then use linear kernel All the action is in the feature transform! Nikki Mirghafori EECS 225D -- Verification 33 4/23/12 Combination of Systems • Systems work best in combination, especially ones using “higher level” features • Need to estimate optimal combination weight. E.g., use neural network • Combination weights trained on a held-out development dataset GMM MML R WordHM M PhoneNgra m Neural Network Combiner Nikki Mirghafori EECS 225D -- Verification 34 4/23/12 Variability: The Achilles Heel... •Variability (extrinsic & intrinsic) in the spectrum can cause error •Data of focus has mainly been extrinsic •“Channel” mismatch: • Microphone •carbon-button, hands-free,.. •Acoustic environment •Office, car, airport, ... •Transmission channel •Landline, cellular, VoIP, ... •Three compensation approaches: •Feature-based •Model-based •Score-based Nikki Mirghafori Compensation techniques help reduce error. Error Rates Factor of 20 worse Matched Handsets Factor of 2.5 worse '96 EECS 225D -- Verification 35 '99 4/23/12 Mismatched Handsets NIST Speaker Verification Evaluations • • • Annual NIST evaluations of speaker verification technology (since 1996) Aim: Provide a common paradigm for comparing technologies Focus: Conversational telephone speech (text-independent) Data Provider Evaluation Coordinator Linguistic Data Consortium Comparison of technologies on common task Technology Developers Evaluate Improve Nikki Mirghafori EECS 225D -- Verification 4/23/12 The NIST Evaluation Task • Conversational telephone speech, interview • Landline, cellular, hands-free, multiple-mics in room • 5 min of conversations between two speakers • Various conditions, e.g., • Training: 8, 1, or other number of conversation sides • Test: 1 conversation side, 30 secs, etc. • Evaluation: • Equal Error Rate (EER) • Decision Cost Function (DCF) • • • Nikki Mirghafori = (10, 1, 0.01) EECS 225D -- Verification 37 4/23/12 The End • What’s one interesting you learned today you may share with a friend over dinner conversation? Nikki Mirghafori EECS 225D -- Verification 38 4/23/12 Backup slides Nikki Mirghafori EECS 225D -- Verification 39 4/23/12 Word Conditional Models -example • Boakye et al. (2004) • 19 words and bi-grams • • • • • Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time Combines well with lowlevel (i.e., cepstral GMM) system, especially with more training data Nikki Mirghafori EECS 225D -- Verification 40 4/23/12 Phone N-Grams -- example • • • Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker • • • Use open-loop phone recognition to obtain phone hypotheses Create models of relative frequencies of phone ngrams of the speaker vs. “others” Use SVM for modeling Combines well, esp. with increased data Works across languages Nikki Mirghafori EECS 225D -- Verification 41 4/23/12