Auditory Scene Analysis and Automatic Speech Recognition

advertisement
Auditory Scene Analysis and
Automatic Speech Recognition in
Adverse Conditions
Phil Green
Speech and Hearing Research Group,
Department of Computer Science,
University of Sheffield
With thanks to
Martin Cooke, Guy Brown, Jon
Barker..
HCSNet December 2005
Overview
• Visual and Auditory Scene Analysis
• ‘Glimpsing’ in Speech Perception
• Missing Data ASR
• Finding the glimpses
• Current Sheffield Work
• Dealing with Reverberation
• Identifying Musical Instruments
• Multisource Decoding
• Speech Separation Challenge
HCSNet December 2005
Visual Scenes and Auditory Scenes
• Objects are opaque
• Each spatial pixel
images a single object
• Object recognition has
to cope with occlusion
HCSNet December 2005
• Sound is additive
• Each time/frequency pixel
receives contributions from many
sound sources
• Sound source recognition
apparently requires
‘Glimpsing’ in auditory
scenes: the dominance
effect (Cooke)
Although audio signals add
‘additively’, the occlusion
metaphor is a good
approximation due to loglike
compression in the auditory
system
HCSNet December 2005
Consequently, most regions in
a mixture are dominated by
one or other source, leaving
very few ambiguous regions,
even for a pair of speech
signals mixed at 0 dB.
Can listeners handle
glimpses?
HCSNet December 2005
The robustness problem in
Automatic Speech
Recognition
• Current ASR devices cannot Clean speech
tolerate additive noise,
particularly if it’s unpredictable
• Listener’s noise-tolerance is 1 or
2 orders of magnitude better in
equivalent conditions (Lippmann
97)
+noise
• Can glimpsing be used as the
basis for robust ASR?
Requirements:
• Adapt statistical ASR to
Missing data
incomplete data case
Mask (oracle)
• Identify the glimpses
HCSNet December 2005
Classification with Missing
Data
A common problem: visual occlusion, sensor failure, transmission
losses..
Need to evaluate the likelihood that observation vector x
was generated by class C , f(x|C)
Assume x has been partitioned into reliable and unreliable parts, (xr,xu)
Two approaches:
Imputation: estimate xu , then proceed as normal
Marginalisation: integrate over possible range of xu
Marginalisation is preferable if there is no need to reconstruct x
HCSNet December 2005
The Missing Data Likelihood
Computation
In ASR by Continuous Density HMMS,
• State distributions are Gaussian Mixtures with diagonal
covariance
• The marginal is just the reduced dimensionality distribution
• The integral can be approximated by ERFS
• This is computed independently for each mixture in the
state distribution
HCSNet December 2005
Cooke et al 2001
Counter-evidence from
bounds
reliable
unreliable
Mean spectrum for class C
frequency
Observed spectrum x
Class C matches the reliable evidence well
but there is insufficient energy in the unreliable components
HCSNet December 2005
Finding the glimpses
Auditory scene analysis identifies
spectral regions dominated by a
single source
• Harmonicity
• Common amplitude
modulation
• Sound source location
Local SNR estimates can be used
to compensate for predictable
noise sources.
HCSNet December 2005
Cooke 91
Harmonicity Masks
• Only meaningful in voiced segments
• Can be combined with SNR masks
HCSNet December 2005
Aurora Results (Sept
2001)
Barker et al
2001
Average gain over clean baseline under all conditions:
65%
HCSNet December 2005
Missing data masks from
spatial location
Sue Harding, Guy Brown
• Cues for spatial location are used to separate a target
source from masking sources
• Interaural Time Difference from corss-correlation
between left and right binaural signals
• Interaural Level Difference from ratio of energy in left
and right ears
• Soft masks
• Task:
• Target source: male speaker straight ahead
• One or two masking sources (also male speakers)
at other positions
• Added reverberation
HCSNet December 2005
60
50
40
30
20
10
Localisation mask, ILD/ITD
Localisation mask, ILD only
Frequency channel
Localisation mask, ITD only
Frequency channel
60
50
40
30
20
10
40
60
80
100 120
Time (frames)
Oracle
ITD only,
ILD only,
combined ITD and
ILD.
20
40
60
80
60
50
40
30
20
10
100 120
20
Time (frames)
40
60
80
100 120
Time (frames)
100
90
80
% Accuracy
20
% Accuracy
Frequency channel
Missing data masks from
spatial location (2)
70
60
50
Best performance is with
combined ITD and ILD:
HCSNet December 2005
40
30
5 7.5 10
15
20
30
Azimuth of masker (degrees)
40
•
MD for reverberant conditions
(1)
Palomäki, Brown and Barker have applied MD to
the problem of room reverberation:
• Use spectral normalization to deal with distortion
caused by early reflections;
• Treat late reverberation as additive noise, and apply
standard MD techniques.
• Select features which are uncontaminated by
reverberation and contain strong speech energy.
Approach based on modulation filtering:
• Each rate map channel
passed through modulation
filter
• Identify periods with enough
energy in the filtered output
• Use these to define mask on
HCSNet December 2005
original rate map
MD for reverberant conditions (2)
HCSNet December 2005
80
60
40
HMM-MLP Baseline
20
MD A priori Mask
MD Reverb Mask
0
C
le
0.
an
7s
2.
35
0.
m
7s
3.
05
m
1.
2s
1.
2m
2s
3.
05
m
1.
5s
6.
1.
1m
5s
18
.3
m
K. J. Palomäki, G. J. Brown and J. Barker
(2004) Speech Communication 43 (1-2), pp.
123-142
Recognition accuracy (%)
• Recognition of connected
digits (Aurora 2)
• Reverberated using recorded
room impulse responses
• Performance comparable
with Brian Kingsbury’s hybrid
HMM-MLP recognizer
100
T60/source-receiver distance
MD for music analysis (1)
• Eggink and Brown have used MD techniques to
identify concurrent musical instrument sounds
• Part of a system for transcribing chamber music
• Identify the F0 of the target note, and only keep
its harmonics in the MD mask
• Uses a GMM classifier for each instrument,
trained on isolated tones and short phrases
• Tested on tones, phrases and commercial CD
HCSNet December 2005
MD for music analysis (2)
J. Eggink and G. J. Brown (2003) Proc.
ICASSP, Hong Kong, IV, pp. 553-556
J. Eggink and G. J. Brown (2004) Proc.
ICASSP, Montreal, V, pp. 217-220
Clarinet
Fundamental Frequency (Hz)
• Example: duet for flute
and clarinet
• All instrument tones
correctly identified in
this example
Flute
700
600
500
400
300
200
100
0
20
40
60
80
100
Time (frames)
HCSNet December 2005
120
Multisource Decoding
Use primitive ASA and local SNR to identify time-frequency regions
(fragments) dominated by a single source… i.e. possible segregations S
… but NOT to decide what the best segregation is
Instead, jointly optimise over the word sequence W and S
Decoding algorithm finds best subset of fragments to match speech source
Based on missing data techniques – regions hypothesised as nonspeech are missing
Barker, Cooke & Ellis 2003
HCSNet December 2005
Multisource decoding
algorithm
Work forward in time,
maintaining a set of
alternative decodings –
Viterbi searches based on a
choice of speech fragments.
When new fragment arrives,
split decodings - speech or
non-speech?
When fragment ends, merge decoders which differ in its interpretation.
HCSNet December 2005
Multisource Decoding on Aurora
HCSNet December 2005
Multisource decoding with a
competing speaker
Andre Coy and Jon Barker
• Utterances of male and female speakers mixed at 0 db
• Voiced regions: Soft Harmonicity masks from autocorrelation
peaks
• Voiceless regions: fragments from ‘image processing’
• Gender-dependent HMMs.
• Separate decoding for male & female
• 73.7% accuracy on a connected digit task
Informing Multisource
Decoding – Work in progress
Ning Ma, Andre Coy, Phil Green
• HMM Duration constraints
• Links between fragments – pitch continuity
• ‘Speechiness’
HCSNet December 2005
Speech separation challenge
Organisers: Martin Cooke (University of Sheffield, UK) , TeWon Lee (UCSD, USA)
• see http://www.dcs.shef.ac.uk/~martin
• Global comparison of techniques for separating and
recognising speech
• Special session of Interspeech 2006 in Pittsburgh (USA) from
17-21 September, 2006.
• Task- recognise speech from a target talker in the presence of
either stationary noise or other speech.
• Training and test data supplied.
• One signal per mixture (i.e. the task is "single microphone").
• Speech material- simple sentences from the ‘Grid Task’, e.g.
“place white at L 3 now"
HCSNet December 2005
Download