Biologically Inspired Noise- Robust Speech Recognition for Both Man and Machine

advertisement
Biologically Inspired NoiseRobust Speech Recognition
for Both Man and Machine
Mark D. Skowronski
Computational NeuroEngineering Lab
University of Florida
March 26, 2004
Speech Recognition Motivation
Speech #1 real-time communication
medium among humans.
Advantages of voice interface to machines:
• Hands-free operation
• Speed
• Ease of use
2
Man vs. Machine
Man is a high-performance
existence proof for speech
processing in noisy
environments.
Can we emulate man’s
performance by leveraging
expert information into our
systems?
Wall Street Journal/Broadcast news readings, 5000 words
Untrained human listeners vs. Cambridge HTK LVCSR system
3
Biologically Inspired Algorithms
Expert Information is added in three applications:
Speech enhancement
for human listeners
Feature extraction for
automatic speech recognition
Classification for automatic
speech recognition
4
Speech Enhancement
Motivations:
• Noisy cell phone conversations
• Public address systems
• Aircraft cockpit
What can we do to increase
intelligibility when turning up the
volume is not an option?
Biology:
Lombard effect
5
This work funded by the iDEN Technology Group of Motorola
The Lombard Effect
Psychophysical changes in vocal characteristics, produced
by a speaker in the presence of background acoustic noise:
•
•
•
•
•
•
Vocal effort (amplitude) increases
Duration increases
Pitch increases
Formant frequencies increase
Energy center of gravity increases
Consonant-to-Noise ratio increases
Result: Intelligibility increases
6
Psychoacoustic Experiments
Miller and Nicely (1955): AWGN to speech affects place of articulation
and frication most, less so for voicing and nasality.
Furui (1986): Truncated vowels in consonant-vowel pairs dramatically
decreased in intelligibility beyond a certain point of truncation. These
points correspond to spectrally dynamic regions.
Bottom Line:
Speech contains regions of relatively high phonetic information, and
emphasis of these regions increases intelligibility.
7
Solution: Energy Redistribution
We redistribute energy from regions of low information content to regions
of high information content while conserving overall energy.
From Miller and Nicely:
ER for Voiced/Unvoiced (ERVU) regions.
SFM of “clarification”
Voicing determined by the Spectral
Flatness Measure (SFM):
1
N


  X j (k) 

SFM j   k 1 N
1
X j (k)

N k 1
N
Xj(k) is the magnitude of the short-term Fourier
transform of the jth speech window of length N.
8
M. D. Skowronski, J. G. Harris, and T. Reinke, J. Acoust. Soc. Am., 2002
Listening Tests
Confusable set test, from Junqua*
I f, s, x, yes
II a, h, k, 8
III b, c, d, e, g, p, t, v, z, 3
IV m, n
• 500 trials forced decision
• 3 algorithms (control, ERVU, HPF)
• 0 dB and -10 dB SNR, AWGN
• unlimited playback over headphones
• 25 participants, 30-45 minutes
9
J. C. Junqua, J. Acoust. Soc. Am., 1993*
Listening Tests Results
-10 dB SNR, white noise
Errors
decreased
20% compared
to control.
“S”
10
“A”
“E”
“M”
Energy Redistribution Summary
• Developed a real-time algorithm for cell phone
applications using biological inspiration,
• Increased intelligibility while maintaining
naturalness and conserving energy,
• Effective because everyday speech is not clearly
enunciated,
• ERVU is a novel approach to speech enhancement
that works on either clean speech or noise-reduced
speech.
11
M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004b (in preparation)
Feature Extraction
ASR: Input
Feature Extraction
Information: phonetic,
gender, age, emotion, pitch,
accent, physical state,
additive/channel noise.
HFCC filter bank
12
Classification
Existing Algorithms
Goal: emphasize phonetic information over other info streams.
Feature algorithms:
• Acoustic: formant frequencies, bandwidths
• Model based: linear prediction
• Filter bank based: mel freq cepstral coeff (MFCC)
Provides dimensionality reduction on quasi-stationary windows.
Features
“seven”
Time
13
MFCC Filter Bank
• Design parameters: FB freq range, number of filters
• Center freqs equally-spaced in mel frequency
• Triangle endpoints set by center freqs of adjacent filters
Although filter spacing is determined by perceptual mel
frequency scale, bandwidth is set more for convenience
than by biological arguments.
14
HFCC Filter Bank
HFCC: human factor cepstral coefficients
• Decouples filter bandwidth from filter spacing,
• Sets filter width according to the critical bandwidth of the
human auditory system,
• Uses Moore and Glasberg approximation of critical
bandwidth, defined in Equivalent Rectangular Bandwidth
(ERB).
ERB  6.23fc2  93.39fc  28.52 (Hz)
fc is critical band center frequency (KHz).
15
M. D. Skowronski and J. G. Harris, ICASSP, 2002
HFCC with E-factor
Linear ERB scale factor (E-factor) controls filter bandwidth
E-factor = 1
E-factor = 3
• Controls tradeoff between local SNR and spectral resolution,
• Exemplifies the benefits of decoupling filter bandwidth from filter
spacing.
16
M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004a (submitted)
ASR Experiments
• Isolated English digits “zero” through “nine” from
TI-46 corpus, 8 male speakers,
• HMM word models, 8 states per model, diagonal
covariance matrix,
• Three MFCC versions (different filter banks),
• Linear ERB scale factor (E-factor),
• HFCC with E-factor (HFCC-E).
Total: 37.9 million frames of speech, (>100 hours)
17
ASR Results
White noise (global SNR), HFCC-E vs. D&M,
Linear ERB scale factor (E-factor).
18
M. D. Skowronski and J. G. Harris, ISCAS, 2003
HFCC Summary
• Adds biologically inspired bandwidth to filter bank
of popular speech feature extractor,
• Provides superior noise-robust performance over
MFCC and variants,
• Allows for further filter bank design modifications,
demonstrated by HFCC with E-factor,
• HFCC has the same computational cost as MFCC,
only the filter bank coefficients are adjusted: easy to
implement.
19
Classification
•
•
•
•
•
HMM Limitations & Variations
Freeman Model Introduction
Model Hierarchy
Associative Memory
ASR Experiments
Freeman’s Reduced KII Network
20
This work funded by the Office of Naval Research grant N00014-1-1-0405
HMM Limitations & Variations
Limitations:
• HMM is piece-wise stationary; speech is nonstationary,
• Assumes frames are i.i.d.; speech is coarticulated,
• State PDFs are data-driven; curse of dimensionality.
Variations:
Deng (1992): trended HMM
Rabiner (1986): autoregressive HMM
Morgan & Bourlard (1995): HMM/MLP hybrid
Robinson (1994): context-dependent RNN
Herrmann (1993): transient attractor network
Liaw & Berger (1996): dynamic synapse RNN
~ HMM
Nonlinear
Dynamic
Freeman (1997): non-convergent dynamic biological model
21
Freeman Model
Hierarchical nonlinear dynamic model of cortical signal
processing from rabbit olfactory neo-cortex.
K0 cell, H(s) 2nd order low pass filter
Reduced KII (RKII) cell (stable oscillator)
1 
m  (a  b)m  abm   K gmQ(g)  I
ab
1 
g  (a  b)g  abg   K mg Q(m)
ab
22
RKII Network
High-dimensional, scalable network of stable oscillators.
Fully connected M-cell and G-cell weight matrices (zero diagonal).
Capable of several dynamic behaviors:
• Stable attractors (limit cycle, fixed point)
• Chaos
• Spatio-temporal patterns
• Synchronization
Generalization
23
Associative Memory
Oscillator Network
Two regimes of operation as an associative memory of binary patterns:
Energy
Synchronization Through
Stimulation (STS)
Network weights for each regime set by outer
product rule variation and by hand.
24
M. D. Skowronski and J. G. Harris, Phys. Rev. E, 2004 (in preparation)
Associative Memory
Input
Full:
Partial:
Noisy:
Spurious:
25
Output
Input
Output
ASR with RKII Network
Two-Class Case
• \IY\ from “she”
• \AA\ from “dark”
• 10 HFCC-E coeffs.
converted to binary
• Energy-based RKII
associative memory
• No overlap between
learned centroids
26
Classifier
Bayes,
\IY\
continuous \AA\
\IY\
\AA\
%
Correct
2705
0
99.9
8
4340
Bayes,
\IY\
2701
4
binary
\AA\
110
4238
Hamming
\IY\
2658
47
distance
\AA\
394
3954
RKII,
\IY\
2593
6
exact
\AA\
202
3564
RKII,
\IY\
2666
39
Hamming
\AA\
479
3869
98.4
93.7
87.3
92.7
ASR with RKII Network
Three-Class Case
•
•
•
•
\IY\ from “she”
\AA\ from “dark”
\AE\ from “ask”
18 HFCC-E coeffs.
converted to binary
• Energy-based RKII
associative memory
• Variable overlap between
learned centroids
Overlap controlled by binary feature conversion
More overlap
more spurious outputs
27
Freeman Model Summary
Contributions:
• Documented impulse invariance discretization,
• Developed software tools, enabling large-scale
experiments,
• Demonstrated stable attractors in Freeman model,
• Explained attractor instability by transient chaos,
• Proposed two regimes of associative memory,
• Invented novel synchronization mechanism (STS),
• Devised variation of outer product rule for oscillator
network learning rule,
• Proved practical probabilities concerning overlap in
three-class case,
• Applied novel static pattern classifier to ASR.
28
Conclusions
Developed novel speech enhancement algorithm,
- Lombard effect indicates what to modify,
- Psychoacoustic experiments indicate where to modify,
- ERVU reduces human recognition error 20-40% in noisy
environments.
Extended existing speech feature extraction algorithm,
- Critical bandwidth used to decouple filter bandwidth and spacing,
- HFCC-E demonstrates research tangent for ind. filter bandwidth,
- HFCC-E improves ASR by 7 dB SNR.
Advanced knowledge of NLD models for info processing.
- Applied model to ASR of static speech features,
- Near-optimum performance of RKII network associative memory
using first-order statistics.
29
Download