Biologically Inspired Noise- Robust Speech Recognition for Both Man and Machine

advertisement
Biologically Inspired NoiseRobust Speech Recognition
for Both Man and Machine
Mark D. Skowronski
Ph.D. Proposal
University of Florida
Gainesville, FL, USA
Outline
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
Biological Inspiration
Example of
Read Speech:
AWGN:
10 dB SNR
Wall Street Journal/Broadcast news readings
Untrained human listeners vs Cambridge HTK LVCSR system
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
Speech Enhancement
Motivations:
•Noisy cell phone conversations
•Power-constrained transducers
•Public address systems in noisy environments
What can you do when
turning up the volume
is not an option?
Biology:
Lombard Effect
The Lombard Effect
Lombard Effect: changes in vocal characteristics, produced by a speaker
in the presence of background noise.
•
•
•
•
•
•
Amplitude increases.
Duration increases.
Pitch increases.
Formant frequencies increase.
High-freq to low-freq energy ratio increases.
Intelligibility increases.
Psychoacoustic Experiments
Speech contains regions of relatively high information content,
and emphasis of these regions increases perceived intelligibility.
• Fletcher (1953): LPF or HPF phonemes varied in
robustness to the filtering process, with vowels being the
most robust.
• Miller and Nicely (1955): AWGN to speech affects place
of articulation and frication most, less so for voicing and
nasality.
• Furui (1986): Truncated vowels in consonant-vowel pairs
dramatically decreased in intelligibility beyond a certain
point of truncation. These points correspond to spectrally
dynamic regions.
Solution: Energy Redistribution
We redistribute energy from regions of low information content
to regions of high information content while conserving overall
energy across words.
SFM of “clarification”
We partition speech into
Voiced/Unvoiced regions
using the Spectral
Flatness Measure (SFM):
1
N
 N

  X j (k ) 

SFM j   k 1 N
1
X j (k )

N k 1
Xj(k) is the magnitude of the short-term Fourier
transform of the jth speech window of length N.
Listening Test
Confusable set test, from Junqua
I f, s, x, yes
II a, h, k, 8
III b, c, d, e, g, p, t, v, z, 3
IV m, n
• 500 trials forced decision
• 3 algorithms (control, ERVU, HPF)
• 0 dB and -10 dB SNR, AWGN
• unlimited playback over headphones
• 26 participants, 30-45 minutes
Listening Test Results
-10 dB SNR, white noise
Errors decreased
20% compared
to control.
“S”
“A”
“E”
“M”
Energy Redistribution Summary
• Biologically inspired
– Lombard Effect says how to modify.
– Psychoacoustic experiments say where to modify.
• Increases intelligibility while maintaining
naturalness and conserving energy.
• Naturalness elegantly preserved by retaining
spectral and temporal cues.
• Effective because everyday speech is not clearly
enunciated.
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
ASR Introduction
Automatic Speech Recognition is the extraction of linguistic
information from an utterance of speech (Text-to-Speech).
•
•
•
•
Isolated/Continuous speech
Dependent/Independent speaker operation
Word/Phoneme recognition unit
Vocabulary size and perplexity
Input
Feature Extraction
Classification
Input
“seven”
Information: phonetic, gender, age, emotion,
pitch, accent, physical state, additive/channel noise
Feature Extraction
Goal: emphasize phonetic information over other characteristics.
• Acoustic: formant frequencies, bandwidths
• Model based: linear prediction
• Filter-bank based: mel freq cepstral coeff (mfcc)
Provides dimensionality reduction on quasi-stationary windows.
“seven”
Features
Time
Hidden Markov Model
“one”
Time domain
State space
Feature space
MFCC Algorithm
MFCC--the most widely-used speech feature extractor.
“seven”
x(t)
F
Mel-scaled
filter bank
Log
energy
DCT
Cepstral
domain
Filter #
Time
DCT vs Eigenvectors
Spectra of DCT basis vectors
Spectra of Eigenvectors from
log energy of filtered speech
Basis #
Frequency
Average spectral difference < 15%
MFCC Filter Bank
• Design parameters: FB freq range, number of filters.
• Center freqs equally-spaced in mel frequency.
• Triangle endpoints set by center freqs of adjacent filters.
Although filter spacing is determined by perceptual mel frequency
scale, bandwidth is set more for convenience than by biological
motivation.
Human Factor Cepstral Coefficients
• Decouple filter bandwidth from filter bank design parameters.
• Set filter width according to the critical bandwidth of the human
auditory system.
• Use Moore and Glasberg approximation of critical bandwidth,
defined in Equivalent Rectangular Bandwidth (ERB).
ERB  6.23 f c2  93.39 f c  28.52 (Hz)
fc is critical band center frequency (KHz).
ASR Experiments Review
• Isolated English digits “zero” through “nine” from
TI-46 corpus, 8 male speakers,
• HMM word models, 8 states per model, diagonal
covariance matrix,
• Three mfcc versions (different filter banks),
• Several degrees of freedom,
• Linear ERB scale factor.
ASR Results
White noise (local SNR), hfcc vs D&M
ASR Results
White noise (global SNR), hfcc vs D&M, Linear
ERB scale factor (E-factor).
HFCC Conclusions
• Added biologically inspired bandwidth to
filter bank of popular speech feature
extractor.
• Decoupled bandwidth from other filter
bank design parameters.
• Demonstrated superior noise-robust
performance of new feature extractor.
• Demonstrated advantages of wider filters.
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
HMM Limitations
• HMMs are piecewise-stationary, while
speech is continuous and nonstationary.
• Assumes frames of speech are i.i.d.
• State pdf estimates are data-driven.
HMMs make no claim of modeling biology.
Novel Classifiers
•
•
•
•
•
•
•
•
Deng's trended HMM.
Rabiner's autoregression HMM.
Morgan's HMM/neural network hybrid.
Robinson's recurrent neural network.
Wismüller's self-organizing map.
Herrmann's transient attractor network.
Maass' dynamic synapse MLP.
Berger's dynamic synapse RNN.
Freeman's Chaotic Model
• Biologically inspired nonlinear dynamic model of
cortical signal processing, from rabbit olfactory
neo-cortex experiments.
• A hierarchical network of oscillators that are
locally stable and globally chaotic.
• Demonstrated as classifier of static patterns.
• Represents a radical departure from current
classifier paradigms.
KI Model
• Smallest element in network hierarchy.

1  d2
d
x
(
t
)

(
a

b
)

x
(
t
)

(
a

b
)

x
(
t
)
i
i
i


a  b  dt 2
dt

N
 Wij  Q( x j (t ), q j )  I i (t )
j i
i  1,, N
• a,b constants
• state variable xi(t)
• N states
• Wij weight from state i to state j
• asymmetric sigmoid Q
• input Ii(t) to state i.
Reduced KII Network
• Locally stable element is KII network.
• m(t) excitatory mitral cell
• g(t) inhibitory granule cell
• Weights Kmg > 0, Kgm < 0
• N pairs in parallel
• Mitral cells fully connected
• Granule cells fully connected
• Input I(t) into excitatory cell.
KII Simulations
g(t)
m(t)
Reduced KII reaches steady state point attractor
or limit cycle, based on |Kmg · Kgm|.
• Introduction
• Biologically inspired algorithms
– Speech: Energy Redistribution
– Features: Human Factor Cepstral Coefficients
– Classifier: Nonlinear dynamic systems
• Future work
Work Completed
1.
Developed biologically inspired algorithms:
• Energy redistribution: combines Lombard Effect
(how) with psychoacoustic experimental results
(where) to increase speech intelligibility.
• Human factor cepstral coefficients: combines
existing speech front end (mfcc) with critical
bandwidth information (ERB).
2. Published 3 papers, and submitted 3 more, on novel
algorithms.
3. Literature survey on novel speech classifiers, and
simulations of nonlinear Freeman model.
Work Proposed
1.
2.
3.
4.
5.
Compare hfcc to human speech recognition using
rhyming test in ASR experiments.
Measure affects of ERVU in ASR experiments.
Analyze hfcc algorithm, accounting for nonlinear log(·)
function.
Experiment with other bandwidth functions besides ERB
or scaled ERB.
Quantify tradeoff between spectral resolution and noise
smoothing for hfcc using synthetic data.
Work Proposed, Con't
6.
7.
8.
Build on the reduced KII network results recently
reported by CNEL suggesting the network can
operate as a content-addressable memory (CAM).
Investigate alternative information storage
strategies to CAM, focusing on inherent timevarying nature of dynamic system (coupling theory
is intriguing).
Expand literature search to areas outside speech
recognition experiments that use nonlinear dynamic
(chaotic) systems for information
processing/storage, with emphasis on applications
with time-varying signals.
Work Proposed, Con't
9.
Consider alternative roles for nonlinear dynamics:
embedded extracted features for hfcc/HMM
system, trajectory tracking in the spirit of Deng’s
trended HMM.
10. Demonstrate classification of static vowel patterns
(vowel phonemes) with novel classifier, in presence
of noise.
11. Demonstrate classification of time-varying signals
(isolated English digits, rhyming test corpus), in
noisy environments.
Download