Audio Signal Recognition for Speech, Music, and Environmental Sounds

advertisement
Audio Signal Recognition
for Speech, Music,
and Environmental Sounds
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 1 / 25
1
Pattern Recognition for Sounds
•
Pattern recognition is abstraction
- continuous signal → discrete labels
- an essential part of understanding?
“information extraction”
•
Dan Ellis
Sound is a challenging domain
- sounds can be highly variable
- human listeners are extremely adept
Audio Signal Reecognition
2003-11-13 - 2 / 25
Pattern classification
•
Classes are defined as distinct region
in some feature space
- e.g. formant frequencies to define vowels
F2/Hz
4000
2000
3000
ay
ao
x
2000
1000
1000
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
1.0
time/s
0
0
•
Issues
- finding segments
to classify
- transforming to
an appropriate
feature space
- defining the
class boundaries
2000
F1/Hz
Pols vowel formants: "u" (x), "o" (o), "a" (+)
1600
1400
+
1200
1000
o
x
800
*
600
0
Dan Ellis
1000
1800
F2 / Hz
f/Hz
200
Audio Signal Reecognition
400
new observation x
600
800
1000
F1 / Hz
1200
2003-11-13 - 3 / 25
1400
1600
Classification system parts
Sensor
signal
Pre-processing/
segmentation
• STFT
• Locate vowels
segment
Feature extraction
• Formant extraction
feature vector
Classification
class
Post-processing
Dan Ellis
• Context constraints
• Costs/risk
Audio Signal Reecognition
2003-11-13 - 4 / 25
Feature extraction
•
Feature choice is critical to performance
- make important aspects explicit,
remove irrelevant details
- ‘equivalent’ representations
can perform very differently in practice
- major opening for domain knowledge
(“cleverness”)
•
Mel-Frequency Cepstral Coefficients (MFCCs):
Ubiquitous speech features
- DCT of log spectrum on ‘auditory’ scale
- approximately decorrelated ...
MFCCs
cep. coef.
freq. channel
Mel Spectrogram
30
20
10
10
5
0
0
0
1
Dan Ellis
2
3
0
Audio Signal Reecognition
1
2
time / sec
2003-11-13 - 5 / 25
3
Statistical Interpretation
•
•
Observations are random variables
whose distribution depends on the class:
Class
Observation
ωi
x
(hidden)
discrete
p(x|ωi)
Pr(ωi|x)
continuous
Source distributions p(x|ωi)
- reflect variability in feature
- reflect noise in observation
- generally have to be estimated from data
(rather than known in advance)
p(x|ωi)
ω 1 ω2 ω3 ω4
x
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 6 / 25
Priors and posteriors
•
Bayesian inference can be interpreted as
updating prior beliefs with new information, x:
Likelihood
p ( x ωi )
Pr ( ω i ) ⋅ --------------------------------------------------- = Pr ( ω i x )
∑ p ( x ω j ) ⋅ Pr ( ω j )
Prior
probability
j
‘Evidence’ = p(x)
Posterior
probability
•
Posterior is prior scaled by likelihood
& normalized by evidence (so Σ(posteriors) = 1)
•
Minimize the probability of error by
choosing maximum a posteriori (MAP) class:
ω̂ = argmax Pr ( ω i x )
ωi
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 7 / 25
Practical implementation
•
Optimal classifier is ω̂ = argmax Pr ( ω i x )
ωi
but we don’t know Pr ( ω i x )
•
So, model conditional distributions
p ( x ω i ) then use Bayes’ rule to find MAP class
p(x|ω1)
Labeled
training
examples
{xn,ωxn}
Sort
according
to class
Dan Ellis
Audio Signal Reecognition
Estimate
conditional pdf
for class ω1
2003-11-13 - 8 / 25
Gaussian models
•
Model data distributions via parametric model
- assume known form, estimate a few parameters
•
E.g. Gaussian in 1 dimension:
1
1  x – µ i
p ( x ω i ) = ---------------- ⋅ exp – ---  -------------
2  σi 
2πσ i
normalization
2
For higher dimensions, need mean vector µi
•
and d x d covariance matrix Σi
5
4
1
0.5
3
0
2
4
1
2
4
2
0 0
•
Dan Ellis
0
0
1
2
3
4
5
Fit more complex distributions with mixtures...
Audio Signal Reecognition
2003-11-13 - 9 / 25
Gaussian models for formant data
•
Single Gaussians a reasonable fit for this data
•
Extrapolation of decision boundaries can be
surprising
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 10 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
- How it’s done
- What works, and what doesn’t
3 Other Audio Applications
4 Observations and Conclusions
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 11 / 25
How to recognize speech?
freq / Hz
2
•
Cross correlate templates?
- waveform?
- spectrogram?
- time-warp problems
•
Classify short segments as phones (or ...),
handle time-warp later
- model with slices of ~ 10 ms
- pseudo-piecewise-stationary model of words:
sil
g
w
eh
n
sil
4000
3000
2000
1000
0
0
0.05
Dan Ellis
0.1
0.15
0.2
0.25
0.3
Audio Signal Reecognition
0.35
0.4
0.45 time / s
2003-11-13 - 12 / 25
Speech Recognizer Architecture
•
Almost all current systems are the same:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
•
Dan Ellis
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
Biggest source of improvement is increase in
training data
- .. along with algorithms to take advantage
Audio Signal Reecognition
2003-11-13 - 13 / 25
Speech: Progress
•
Annual NIST evaluations
30%
3%
1990
1995
2000
2005
- steady progress (?), but still order(s) of
magnitude worse than human listeners
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 14 / 25
Speech: Problems
•
Natural, spontaneous speech is weird!
→
•
Dan Ellis
coarticulation
deletions
disfluencies
is word transcription even a sensible approach?
Other major problems
- speaking style, rate, accent
- environment / background...
Audio Signal Reecognition
2003-11-13 - 15 / 25
Speech: What works, what doesn’t
•
What works: Techniques:
- MFCC features + GMM/HMM systems
trained with Baum-Welch (EM)
- Using lots of training data
Domains:
- Controlled, low noise environments
- Constrained, predictable contexts
- Motivated, co-operative users
•
What doesn’t work: Techniques:
- rules based on ‘insight’
- perceptual representations
(except when they do...)
Domains:
- spontaneous, informal speech
- unusual accents, voice quality, speaking style
- variable, high-noise background / environment
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 16 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
- Meeting recordings
- Alarm sounds
- Music signal processing
4 Observations and Conclusions
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 17 / 25
Other Audio Applications:
ICSI Meeting Recordings corpus
3
•
Real meetings, 16 channel recordings, 80 hrs
- released through NIST/LDC
•
Classification e.g.: Detecting emphasized utterances
based on f0 contour (Kennedy & Ellis ’03)
-
per-speaker normalized
f0 as unidimensional
feature → simple
threshold classification
Dan Ellis
55
110
220
Speaker 1
Audio Signal Reecognition
440
f0/Hz
110
440
1760
Speaker 2
2003-11-13 - 18 / 25
f0/Hz
freq / Hz
Personal Audio
•
LifeLog / MyLifeBits /
Remembrance Agent:
- easy to record everything you
hear
•
Then what?
- prohibitive to review
- applications if access easier?
•
Automatic content analysis / indexing...
4
2
freq / Bark
0
50
100
150
200
250 time / min
15
10
5
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
clock time
- find features to classify into e.g. locations
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 19 / 25
Alarm sound detection
Alarm sounds have particular structure
- clear even at low SNRs
- potential applications...
freq / kHz
•
Restaurant+ alarms (snr 0 ns 6 al 8)
4
3
2
1
0
•
Contrast two systems: (Ellis ’01)
- standard, global features, P(X|M)
- sinusoidal model, fragments, P(M,S|Y)
0
MLP classifier output
freq / kHz
0
Sound object classifier output
4
6
9
8
7
3
2
1
0
20
25
30
35
40
45
time/sec 50
- error rates high, but interesting comparisons...
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 20 / 25
Music signal modeling
•
Use “machine listener” to navigate large music
collections
- e.g. unsigned bands on MP3.com
•
Classification to label:
- notes, chords, singing, instruments
- .. information to help cluster music
•
“Artist
models” based on feature distributions
p
0
0.6
0.2
Electronica
fifth cepstral coef
0.4
0
0.2
0.4
0.6
madonna
bowie
0.8
1
Dan Ellis
5
10
15
0.5
0
third cepstral coef
0.5
madonna
bowie
15
10
Country
5
measure similarity between users’ collections
and new music? (Berenzweig & Ellis ’03)
Audio Signal Reecognition
2003-11-13 - 21 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions
- Model complexity
- Sound mixtures
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 22 / 25
Observations and Conclusions:
Training and test data
•
Balance model/data size to avoid overfitting:
Test
data
error
rate
Overfitting
Training
data
training or parameters
•
Diminishing returns from more data:
Word Error Rate
Mo
re t
ters
rai
ame
r
e pa
Mor
nin
gd
ata
44
42
40
WER%
4
Optimal
parameter/data
ratio
38
36
34
Constant
training
time
32
9.25
500
18.5
Training set / hours
1000
Hidden layer / units
37
2000
74 4000
Dan Ellis
WER for PLP12N-8k nets vs. net size & training data
Audio Signal Reecognition
2003-11-13 - 23 / 25
Beyond classification
•
“No free lunch”:
Classifier can only do so much
- always need to consider other parts of system
•
Features
- impose ceiling on system performance
- improved features allow simpler classifiers
•
Segmentation / mixtures
- e.g. speech-in-noise:
only subset of feature dimensions available
→missing-data approaches...
S2(t)
Y(t)
S1(t)
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 24 / 25
Summary
•
Statistical Pattern Recognition
- exploit training data for probabilistically-correct
classifications
•
Speech recognition
- successful application of statistical PR
- .. but many remaining frontiers
•
Other audio applications
- meetings, alarms, music
- classification is information extraction
•
Current challenges
- variability in speech
- acoustic mixtures
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 25 / 25
Extra slides
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 26 / 25
Neural network classifiers
Instead of estimating p ( x ω i ) and using Bayes,
can also try to estimate posteriors Pr ( ω i x )
•
directly (the decision boundaries)
•
Sums over nonlinear functions of sums
give a large range of decision surfaces...
•
e.g. Multi-layer perceptron (MLP):
y k = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ]
j
j
h1
x1
+
wjk +
x2
h
wij + F[·] 2
x3
+
+
y1
y2
Input
layer
•
Dan Ellis
Output
Hidden
layer
layer
Problem is finding the weights wij ... (training)
Audio Signal Reecognition
2003-11-13 - 27 / 25
Neural net classifier
•
Models boundaries, not density p ( x ω i )
•
Discriminant training
- concentrate on boundary regions
- needs to see all classes at once
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 28 / 25
Why is Speech Recognition hard?
•
Why not match against a set of waveforms?
- waveforms are never (nearly!) the same twice
- speakers minimize information/effort in speech
•
Speech variability comes from many sources:
- speaker-dependent (SD) recognizers must
handle within-speaker variability
- speaker-independent (SI) recognizers must also
deal with variation between speakers
- all recognizers are afflicted by background noise,
variable channels
→ Need recognition models that:
- generalize i.e. accept variations in a range, and
- adapt i.e. ‘tune in’ to a particular variant
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 29 / 25
Within-speaker variability
•
Timing variation:
- word duration varies enormously
Frequency
4000
2000
0
0
1.0
0.5
s ow
1.5
2.0
ay
aa ax b aw ax ay ih k t s t ih
l
dx
th n th n ih
I
ABOUT
I
IT'S STILL
THOUGHT
THAT THINK
AND
2.5
3.0
p aa s b ax l
th
SO
POSSIBLE
- fast speech ‘reduces’ vowels
•
Speaking style variation:
- careful/casual articulation
- soft/loud speech
•
Contextual effects:
- speech sounds vary with context, role:
“How do you do?”
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 30 / 25
mbma0
freq / Hz
Between-speaker variability
•
Accent variation
- regional / mother tongue
•
Voice quality variation
- gender, age, huskiness, nasality
•
Individual characteristics
- mannerisms, speed, prosody
8000
6000
4000
2000
0
8000
6000
fjdm2
4000
2000
0
0
Dan Ellis
0.5
1
1.5
Audio Signal Reecognition
2
2.5
2003-11-13 - 31 / 25
time / s
Environment variability
•
Background noise
- fans, cars, doors, papers
•
Reverberation
- ‘boxiness’ in recordings
•
Microphone channel
- huge effect on relative spectral gain
Close
mic
freq / Hz
4000
2000
0
4000
Tabletop
mic
2000
0
0
Dan Ellis
0.2
0.4
0.6
0.8
Audio Signal Reecognition
1
1.2
1.4
2003-11-13 - 32 / 25
time / s
Download