Intro to Speaker Verification (and Diarization)

advertisement
An Intro to
Speaker Recognition
Nikki Mirghafori
Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial,
and A. Stolcke.
Today’s class
• Interactive
• Measures of success for today:
•
•
•
•
You talk at least as much as I do
You learn and remember the basics
You feel you can do this stuff
We all have fun with the material!
Nikki Mirghafori
EECS 225D -- Verification
2
4/23/12
A 10-minute “Project Design”
•
•
You are experts with different backgrounds. Your previous startup companies
were wildly successful. A large VC firm in the valley wants to fund YOUR next
creation, as long as the project is in speaker recognition.
The VC funding is yours, if you come up with some kind of a coherent plan/list
of issues:
•
•
•
•
•
•
What is your proposed application?
What will be the sources of error and variability, i.e., technology
challenges?
What types of features will you use?
What sorts of statistical modeling tools/techniques?
What will be your data needs?
Any other issues you can think of along your path?
Nikki Mirghafori
EECS 225D -- Verification
3
4/23/12
Extracting Information from Speech
•
•
•
What’s noise? what’s
signal?
Orthogonal in many ways
Use many of the same
models and tools
Speech Signal
Goal: Automatically extract
information transmitted in speech
signal
Speech
Recognition
Language
Recognition
Speaker
Recognition
Nikki Mirghafori
EECS 225D -- Verification
Words
“How are you?”
Language Name
English
Speaker Name
James Wilson
4/23/12
Speaker Recognition Applications
•
•
•
•
•
Access control
– Physical facilities
– Data and data networks
Transaction authentication
– Telephone credit card purchases
– Bank wire transfers
– Fraud detection
Monitoring
– Remote time and attendance logging
– Home parole verification
Information retrieval
– Customer information for call centers
– Audio indexing (speech skimming device)
– Personalization
Forensics
– Voice sample matching
Nikki Mirghafori
EECS 225D -- Verification
5
4/23/12
Tasks
• Identification vs. verification
• Closed set vs. open set identification
• Also, segmentation, clustering, tracking...
Nikki Mirghafori
EECS 225D -- Verification
6
4/23/12
Identification
Test Speech
Speaker Model Database
Whose voice is it?
Closed-set
Speaker
Identification
Nikki Mirghafori
EECS 225D -- Verification
7
4/23/12
Identification
Test Speech
Speaker Model Database
Whose voice is it?
Open-set
Speaker
Identification
Nikki Mirghafori
None of
the above
EECS 225D -- Verification
8
4/23/12
Verification/Authentication/Detectio
n
Speaker Model Database
Test Speech
“It’s me!”
Does the voice match?
Verification
requires
claimant ID
Nikki Mirghafori
Yes/No
EECS 225D -- Verification
9
4/23/12
Speech Modalities
• Text-dependent recognition
– Recognition system knows text spoken by person
– Examples: fixed phrase, prompted phrase
– Used for applications with strong control over user input
– Knowledge of spoken text can improve system performance
• Text-independent recognition
– Recognition system does not know text spoken by person
– Examples: User selected phrase, conversational speech
– Used for applications with less control over user input
– More flexible system but also more difficult problem
– Speech recognition can provide knowledge of spoken text
– Text-Constrained recognition. Exercise for the reader.
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Text-constrained Recognition
• Basic idea:
build speaker models for words
rich in speaker information
• Example:
•“What time did you say?
that’s a good plan.”
um... okay, I_think
• Text-dependent strategy in a textindependent context
Nikki Mirghafori
EECS 225D -- Verification
11
4/23/12
Voice as a biometric
• Biometric: a human generated signal or attribute for authenticating a
person’s identity
• Voice is a popular biometric:
– natural signal to produce
– does not require a specialized input device
– ubiquitous: telephones and microphone equipped PC
• Voice biometric with other forms of security
– Something you have - e.g., badge
Strongest
security
Are
– Something you know - e.g., password
– Something you are - e.g., voice
Know
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Have
How to build a system?
•
Feature choices:
•
•
Types of models:
•
•
•
low level (MFCC, PLP, LPC, F0, ...) and high level (words,
phones, prosody, ...)
HMM, GMM, Support Vector Machines (SVM), DTW, Nearest
Neighbor, Neural Nets
Making decisions: Log Likelihood Thresholds, threshold setting
for desired operating point
Other issues: normalization (znorm, tnorm), optimal data selection
to match expected conditions, channel variability, noise, etc.
Nikki Mirghafori
EECS 225D -- Verification
13
4/23/12
Verification Performance
•
•
There are many factors to consider in design of an
evaluation of a speaker verification system
Speech quality
–
–
–
Channel and microphone characteristics
Noise level and type
Variability between enrollment and
verification speech
Speech modality
–
–
Fixed/prompted/user-selected phrases
Free text
Speech duration
–
Duration and number of sessions of
enrollment and verification speech
Speaker population
–
Size and composition
Most importantly: The evaluation data and design should match the target
application domain of interest
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Verification Performance
Text-independent
(Read sentences)
Probability of False Reject (in %)
Military radio Data
Multiple radios &
microphones
Moderate amount
of training data
Text-independent
(Conversational)
Telephone Data
Text-dependent
(Combinations)
Multiple
microphones
Clean Data
Single microphone
Large amount of
train/test speech
Text-dependent
(Digit strings)
Moderate amount
of training data
Telephone Data
Multiple
microphones
Nikki Mirghafori
Probability
of False of
Accept (in %)
Small amount
EECS
225Ddata
-- Verification
training
4/23/12
Verification Performance
PROBABILITY OF FALSE REJECT (in %)
Example
Performance Curve
Wire Transfer:
Application operating
point depends on
relative costs of the
two error types
False acceptance
is very costly
Users may tolerate
rejections for
security
High Security
Equal Error Rate
(EER) = 1 %
Balance
Customization:
High Convenience
False rejections
alienate customers
Any customization
is beneficial
PROBABILITY OF FALSE ACCEPT (in %)
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Human vs. Machine
• Motivation for comparing human
Humans
15%
worse
to machine
– Evaluating speech coders and
potential forensic applications
Humans
44%
better
• Schmidt-Nielsen and Crystal used
NIST evaluation (DSP Journal,
January 2000)
– Same amount of training data
Error
Rates
– Matched Handset-type tests
– Mismatched Handset-type tests
– Used 3-sec conversational
utterances from telephone speech
Nikki Mirghafori
EECS 225D -- Verification
17
4/23/12
Features
•
Desirable attributes of features for an automatic system (Wolf
‘72)
Practical
Robust
Secure
•
•
•
•
•
Occur naturally and frequently in speech
Easily measurable
Not change over time or be affected by speaker’s health
Not be affected by reasonable background noise nor
depend on specific transmission characteristics
Not be subject to mimicry
• No feature has all these attributes
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Training & Test Phases
Enrollment
Phase
Feature
Extraction
Model
Training
Model
for each
speaker
Training speech for each
speaker
Recognition Phase
?
Verificatio
n
Decision
Feature
Extraction
Rejecte
d
(e.g. Verification)
Accepted
“It’s me!”
Nikki Mirghafori
EECS 225D -- Verification
19
4/23/12
Decision making
Verification decision approaches have roots in signal detection theory
•
2-class Hypothesis test:
H0: the speaker is an impostor
H1: the speaker is indeed the claimed speaker.
• Statistic computed on test utterance S as likelihood ratio:
L = log
Likelihood S came from speaker model
Likelihood S did not come from speaker model
Speaker
Model
+
Feature
extraction
-
L
S
Decision
L> q
accept
L< q
reject
Impostor
Model
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Decision making
•
•
Identification: pick model (of N) with best score
Verification: usual approach is via likelihood ratio tests,
hypothesis testing, i.e.:
•
By Bayes:
•
P(target|x)/P(nontarget|x) =
P(x|target)P(target)/P(x|nontarget)P(nontarget)
•
•
accept if > threshold, reject otherwise
Can’t sum over all non-target talkers -- world for SV!
•
•
Use “cohorts” (collection of impostors)
Train “universal”/”world”/”background” model (speaker
independent, it’s trained on many speakers)
Nikki Mirghafori
EECS 225D -- Verification
21
4/23/12
Spectral Based Approach
• Traditional speaker
recognition systems use
• Cepstral feaures
• Gaussian Mixture Models
(GMMs)
Feature
Extractio
n
Sliding window
Fourier
Transform
Log
Nikki Mirghafori
Magnitud
e
Speaker
Model
Adapt
Backgroun
d
Model
Cosine
Transform
log likelihood
ratio
D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using
Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1-3), January/April/July 2000
EECS 225D -- Verification
22
4/23/12
Features: Levels of Information
Hierarchy of
Perceptual Cues
semantics,
idiolects,
pronunciations,
idiosyncrasies
High-level cues
(learned behaviors)
socio-economic
status, education,
place of birth
Dialogic
Idiolectal
prosody, rhythm,
speed intonation, personality type,
volume
parental influence
modulation
acoustic aspects
of speech, nasal,
deep, breathy,
rough
Nikki Mirghafori
anatomical
structure of vocal
apparatus
Semantic
Phonetic
Prosodic
Low-level cues
(physical
characteristics)
EECS 225D -- Verification
23
Spectral
4/23/12
Low
level
features
• Speech production model:
source-filter
interaction
•
Anatomical structure (vocal tract/glottis)
conveyed in speech spectrum
Glottal pulses
Nikki Mirghafori
Vocal tract
EECS 225D -- Verification
Speech signal
4/23/12
Word N-gram Features
Idea (Doddington 2001):
• Word usage can be idiosyncratic to a speaker
• Model speakers by relative frequencies of word N-grams
• Reflects vocabulary AND grammar
• Cf. similar approaches for authorship and plagiarism detection on text documents.
• First (unpublished) use in speaker recognition: Heck et al. (1998)
Implementation:
• Get 1-best word recognition output
• Extract N-gram frequencies
• Model likelihood ratio OR
• Model frequency vectors by SVM
Nikki Mirghafori
EECS 225D -- Verification
25
I_shall
0.002
I_think
0.025
I_would
0.012
…
…
4/23/12
Phone N-gram features
Model the pattern of phone usage or “short term pronunciation”
for a speaker
phone lattice
Open-loop
phone
recognition
jh
zh
eh
k
[+ 0.0254 0.0068 0.0198]
[- 0.0001 0.8827 0.7264]
[- 0.0329 0.2847 0.2983]
Nikki Mirghafori
Support
Vector
Machine
(SVM)
EECS 225D -- Verification
26
phone relative
ngram freq.
jh
0.0254
zh eh
0.0068
k
0.0198
score
4/23/12
MLLR transform vectors as
features
Speaker-dependent
Speaker-independent
Phone class B
Phone class A
Speaker-independent
Speaker-dependent
MLLR Transforms = Features
Nikki Mirghafori
EECS 225D -- Verification
27
4/23/12
Models
•
HMMs:
•
text dep (could use whole word/phone model)
•
•
•
•
•
•
prompted (phone models)
text ind’t (use LVCSR) -- or GMMs!
templates DTW (if text-dependent system)
nearest neighbor: frame level, training data as “model”, non-parametric
neural nets: train explicitly discriminating models
SVMs
Nikki Mirghafori
EECS 225D -- Verification
28
4/23/12
Speaker Models -- HMM
•
Speaker models (voiceprints) represent voice biometric in
compact and generalizable form
• Modern speaker verification systems use Hidden
Markov Models (HMMs)
– HMMs are statistical models of how a speaker
produces sounds
h-a-d
– HMMs represent underlying statistical variations
in the speech state (e.g., phoneme) and temporal
changes of speech between the states.
– Fast training algorithms (EM) exist for HMMs with
guaranteed convergence properties.
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Speaker Models – HMM/GMM
Form of HMM depends on the application
Fixed Phrase
Word/phrase models
“Open sesame”
Prompted phrases/passwords
/s/
/i/
Text-independent
Phoneme models
/x/
single state HMM
General speech
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
Word N-gram Modeling: Likelihood Ratios
•
•
•
•
Model N-gram token log likelihood ratio
Numerator: speaker language model estimated from enrollment data
Denominator: background language model estimated from large speaker
population
Normalize by token count
L
(j)
Spea
ker
log

L
(j)
j
Background
Score
=
1

j
•
Choose all reasonably frequent bigrams or trigrams, or a weighted combination of
both
Nikki Mirghafori
EECS 225D -- Verification
31
4/23/12
Speaker Recognition with SVMs
• Each speech sample (training or test) generates a point
in a derived feature space
• The SVM is trained to separate the target sample from
the impostor (= UBM) samples
• Scores are computed as the Euclidean distance from the
decision hyperplane to the test sample point
• SVMs training is biased against misclassifying positive
examples (typically very few, often just 1)
Background sample
Target sample
Test sample
Nikki Mirghafori
EECS 225D -- Verification
32
4/23/12
Feature Transforms for SVMs
•
•
•
•
•
SVMs have been a boon for higher-level (as well as cepstral
speaker recognition) research – they allow great flexibility in the
choice of features
However, we need a “sequence kernel”
Dominant approach: transform variable-length feature stream into
fixed, finite-dimensional feature space
Then use linear kernel
All the action is in the feature transform!
Nikki Mirghafori
EECS 225D -- Verification
33
4/23/12
Combination of Systems
• Systems work best in combination, especially ones using “higher level” features
• Need to estimate optimal combination weight. E.g., use neural network
• Combination weights trained on a held-out development dataset
GMM
MML
R
WordHM
M
PhoneNgra
m
Neural Network Combiner
Nikki Mirghafori
EECS 225D -- Verification
34
4/23/12
Variability: The Achilles Heel...
•Variability (extrinsic & intrinsic) in
the spectrum can cause error
•Data of focus has mainly been
extrinsic
•“Channel” mismatch:
• Microphone
•carbon-button, hands-free,..
•Acoustic environment
•Office, car, airport, ...
•Transmission channel
•Landline, cellular, VoIP, ...
•Three compensation approaches:
•Feature-based
•Model-based
•Score-based
Nikki Mirghafori
Compensation techniques
help reduce error.
Error
Rates
Factor of 20
worse
Matched
Handsets
Factor of 2.5
worse
'96
EECS 225D -- Verification
35
'99
4/23/12
Mismatched
Handsets
NIST Speaker Verification
Evaluations
•
•
•
Annual NIST evaluations of speaker verification technology (since 1996)
Aim: Provide a common paradigm for comparing technologies
Focus: Conversational telephone speech (text-independent)
Data Provider
Evaluation Coordinator
Linguistic Data Consortium
Comparison of
technologies on
common task
Technology Developers
Evaluate
Improve
Nikki Mirghafori
EECS 225D -- Verification
4/23/12
The NIST Evaluation Task
• Conversational telephone speech, interview
• Landline,
cellular,
hands-free,
multiple-mics
in
room
• 5 min of conversations between two speakers
• Various conditions, e.g.,
• Training:
8, 1, or other number of conversation
sides
• Test: 1 conversation side, 30 secs, etc.
• Evaluation:
• Equal Error Rate (EER)
• Decision Cost Function (DCF)
•
•
•
Nikki Mirghafori
= (10, 1, 0.01)
EECS 225D -- Verification
37
4/23/12
The End
• What’s one interesting you learned today you
may share with a friend over dinner
conversation?
Nikki Mirghafori
EECS 225D -- Verification
38
4/23/12
Backup slides
Nikki Mirghafori
EECS 225D -- Verification
39
4/23/12
Word Conditional Models -example
• Boakye et al. (2004)
• 19 words and bi-grams
•
•
•
•
•
Discourse markers: {actually,
anyway, like, see, well, now,
you_know, you_see, i_think,
i_mean}
Filled pauses: {um, uh}
Backchannels: {yeah, yep, okay,
uhhuh, right, i_see, i_know }
Trained whole-word HMMs,
instead of GMMs, to model
evolution of speech in time
Combines well with lowlevel (i.e., cepstral GMM)
system, especially with
more training data
Nikki Mirghafori
EECS 225D -- Verification
40
4/23/12
Phone N-Grams -- example
•
•
•
Idea (Hatch et al., ‘05): model the pattern of
phone usage or “short term pronunciation” for a
speaker
•
•
•
Use open-loop phone recognition to obtain phone
hypotheses
Create models of relative frequencies of phone ngrams of the speaker vs. “others”
Use SVM for modeling
Combines well, esp. with increased data
Works across languages
Nikki Mirghafori
EECS 225D -- Verification
41
4/23/12
Download