Presentation (PowerPoint File)

advertisement
An Auditory Scene Analysis Approach to
Speech Segregation
DeLiang Wang
Perception and Neurodynamics Lab
The Ohio State University
Outline of presentation

Introduction



Voiced speech segregation based on pitch tracking and
amplitude modulation analysis


Ideal binary mask as CASA goal
Unvoiced speech segregation


Speech segregation problem
Auditory scene analysis (ASA) approach
Auditory segmentation
Neurobiological basis of ASA
Real-world audition
What?
• Source type
Speech
message
speaker
age, gender, linguistic origin, mood, …
Music
Car passing by
Where?
• Left, right, up, down
• How close?
Channel characteristics
Environment characteristics
• Room configuration
• Ambient noise
Humans versus machines
Additionally:
• Car noise is not a very
effective speech masker
• At 10 dB
• At 0 dB
• Human word error rate at
0 dB SNR is around 1%
as opposed to 100% for
unmodified recognisers
(around 40% with noise
adaptation)
Source: Lippmann (1997)
Speech segregation problem
• In a natural environment, speech is usually corrupted by
•
acoustic interference. Speech segregation is critical for
many applications, such as automatic speech recognition
and hearing prosthesis
Most speech separation techniques, e.g. beamforming
and blind source separation via independent analysis,
require multiple sensors. However, such techniques have
clear limits
• Suffer from configuration stationarity
• Can’t deal with single-microphone mixtures or situations where
multiple sounds arrive from close directions
• Most speech enhancement developed for monaural
situation can
interference
deal with
only
stationary
acoustic
Auditory scene analysis (Bregman’90)

Listeners are able to parse the complex mixture of
sounds arriving at the ears in order to retrieve a mental
representation of each sound source



Ball-room problem, Helmholtz, 1863 (“complicated beyond
conception”)
Cocktail-party problem, Cherry’53
Two conceptual processes of auditory scene analysis
(ASA):


Segmentation. Decompose the acoustic mixture into sensory
elements (segments)
Grouping. Combine segments into groups, so that segments in the
same group are likely to have originated from the same
environmental source
Computational auditory scene analysis

Computational ASA (CASA) systems approach sound
separation based on ASA principles



Weintraub’85, Cooke’93, Brown & Cooke’94, Ellis’96, Wang &
Brown’99
CASA progress: Monaural segregation with minimal
assumptions
CASA challenges



Broadband high-frequency mixtures
Reliable pitch tracking of noisy speech
Unvoiced speech
Outline of presentation

Introduction



Voiced speech segregation based on pitch tracking and
amplitude modulation analysis


Ideal binary mask as CASA goal
Unvoiced speech segregation


Speech segregation problem
Auditory scene analysis (ASA) approach
Auditory segmentation
Neurobiological basis of ASA
Resolved and unresolved harmonics



For voiced speech, lower harmonics are resolved while
higher harmonics are not
For unresolved harmonics, the envelopes of filter
responses fluctuate at the fundamental frequency of
speech
Our model (Hu & Wang’04) applies different grouping
mechanisms for low-frequency and high-frequency
signals:


Low-frequency signals are grouped based on periodicity and
temporal continuity
High-frequency signals are grouped based on amplitude modulation
(AM) and temporal continuity
Diagram of the Hu-Wang model
Mixture
Peripheral
and
mid-level
processing
Initial
Segregation
Pitch
Tracking
Unit Labeling
Final
Segregation
Resynthesis
Segregated
speech
Cochleogram: Auditory peripheral model
Spectrogram
Spectrogram
• Plot of log energy across time and
frequency (linear frequency scale)
Cochleogram
• Cochlear filtering by the gammatone
•
•
filterbank (or other models of cochlear
filtering), followed by a stage of
nonlinear rectification; the latter
corresponds to hair cell transduction by
either a hair cell model or simple
Cochleogram
compression operations (log and cube
root)
Quasi-logarithmic frequency scale, and
filter bandwidth is frequency-dependent
Previous work suggests better resilience
to noise than spectrogram
Mid-level auditory representations




Mid-level representations form the basis for segment
formation and subsequent grouping
Correlogram extracts periodicity and AM from
simulated auditory nerve firing patterns
Summary correlogram is used to identify global pitch
Cross-channel correlation between adjacent
correlogram channels identifies regions that are excited
by the same harmonic or formant
Correlogram
• Short-term autocorrelation
•
•
of the output of each
frequency channel of the
cochleogram
Peaks in summary
correlogram indicate pitch
periods (F0)
A standard model of pitch
perception
Correlogram & summary correlogram
of a double vowel, showing F0s
Cross-channel correlation
(a) Correlogram and cross-channel correlation of hair cell response to clean
speech
(b) Corresponding representations for response envelopes
Initial segregation



Segments are formed based on temporal continuity and
cross-channel correlation
Segments generated in this stage tend to reflect resolved
harmonics, but not unresolved ones
Initial grouping into a foreground (target) stream and a
background stream according to global pitch using the
oscillatory correlation model of Wang and Brown (1999)
Pitch tracking


Pitch periods of target speech are estimated from the
segregated speech stream
Estimated pitch periods are checked and re-estimated
using two psychoacoustically motivated constraints:


Target pitch should agree with the periodicity of the time-frequency
units in the initial speech stream
Pitch periods change smoothly, thus allowing for verification and
interpolation
Pitch tracking example
(a) Global pitch (Line: pitch track of clean speech) for a
mixture of target speech and ‘cocktail-party’ intrusion
(b) Estimated target pitch
T-F unit labeling

In the low-frequency range:


A time-frequency (T-F) unit is labeled by comparing the periodicity
of its autocorrelation with the estimated target pitch
In the high-frequency range:


Due to their wide bandwidths, high-frequency filters respond to
multiple harmonics. These responses are amplitude modulated due to
beats and combinational tones (Helmholtz, 1863)
A T-F unit in the high-frequency range is labeled by comparing its
AM repetition rate with the estimated target pitch
AM example
(a) The output of a gammatone filter (center frequency: 2.6 kHz) in
response to clean speech
(b) The corresponding autocorrelation function
AM repetition rates


To obtain AM repetition rates, a filter response is halfwave rectified and bandpass filtered
The resulting signal within a T-F unit is modeled by a
single sinusoid using the gradient descent method. The
frequency of the sinusoid indicates the AM repetition rate
of the corresponding response
Final segregation


New segments corresponding to unresolved harmonics
are formed based on temporal continuity and crosschannel correlation of response envelopes (i.e. common
AM). Then they are grouped into the foreground stream
according to AM repetition rates
Other units are grouped according to temporal and
spectral continuity
Ideal binary mask for performance evaluation



Within a T-F unit, the ideal binary mask is 1 if target
energy is stronger than interference energy, and 0
otherwise
Motivation: Auditory masking - stronger signal masks
weaker one within a critical band
We have suggested to use ideal binary masks as ground
truth for CASA performance evaluation

Consistent with recent speech intelligibility results (Roman et al.’03;
Brungart et al.’05)
Ideal binary mask illustration
Voiced speech segregation example
Systematic SNR results
SNR (in dB)
18
13
8
3
-2
-7
N0
N1
N2
N3
Mixture
Spectral Subtraction


N4
N5
N6
N7
N8
N9
Hu-Wang
Hu-Wang
model model
Wang-Brown model
Evaluation on a corpus of 100 mixtures (Cooke, 1993): 10 voiced
utterances x 10 noise intrusions (see next slide)
Average SNR gain: 12.3 dB; 5.2 dB better than the Wang-Brown model
(1999), and 6.4 dB better than the spectral subtraction method
CASA progress on voiced speech segregation
• 100 mixture set used by Cooke (1993)
• 10 voiced utterances mixed with 10 noise intrusions (N0: tone, N1: white
noise, N2: noise bursts, N3: ‘cocktail party’, N4: rock music, N5: siren,
N6: telephone, N7: female utterance, N8: male utterance, N9: female
utterance)
Original mixture
of voiced speech
+ telephone
+ male
+ female
Cooke
(1993)
Wang &
Ellis
Brown
(1996) (1999)
Hu & Wang
(2004)
Outline of presentation

Introduction



Voiced speech segregation based on pitch tracking and
amplitude modulation analysis


Ideal binary mask as CASA goal
Unvoiced speech segregation


Speech segregation problem
Auditory scene analysis (ASA) approach
Auditory segmentation
Neurobiological basis of ASA
Segmentation and unvoiced speech segretation
• To deal with unvoiced speech segregation, we (Hu &
Wang’04) proposed a model of auditory segmentation
that applies to both voiced and unvoiced speech
• The task of segmentation is to decompose an auditory
scene into contiguous T-F regions, each of which should
contain signal from the same sound source
• The definition of segmentation does not distinguish between voiced
and unvoiced sounds
• This is equivalent to identifying onsets and offsets of
•
individual T-F regions, which generally correspond to
sudden changes of acoustic energy
The segmentation strategy is based on onset and offset
analysis
Scale-space analysis for auditory segmentation
• From a computational standpoint, auditory segmentation
is similar to image (visual) segmentation
• Visual segmentation: Finding bounding contours of visual objects
• Auditory segmentation: Finding onset and offset fronts of segments
• Onset/offset analysis employs scale-space theory, which is
a multiscale analysis commonly used in image
segmentation
• Smoothing
• Onset/offset detection and onset/offset front matching
• Multiscale integration
Example of auditory segmentation
Frequency (Hz)
8000
3255
1246
363
50
0
0.5
1
1.5
Time (s)
2
2.5
Speech segregation
• The general strategy for speech segregation is to first
segregate voiced speech using the pitch cue, and then
deal with unvoiced speech
• To segregate unvoiced speech, we perform auditory
segmentation, and then group segments that correspond
to unvoiced speech
Segment classification
• For nonspeech interference, grouping is in fact a
•
•
classification task – to classify segments as either speech or
non-speech
The following features are used for classification:
• Spectral envelope
• Segment duration
• Segment intensity
Training data
• Speech: Training part of the TIMIT database
• Interference: 90 natural intrusions including street noise, crowd noise,
wind, etc.
• A Gaussian mixture model is trained for each phoneme,
and for interference as well which provides the basis for a
likelihood ratio test
Example of segregating fricatives/affricates
Utterance: “That noise problem grows more annoying each day”
Interference: Crowd noise with music (IBM: Ideal binary mask)
Example of segregating stops
Utterance: “A good morrow to you, my boy”
Interference: Rain
Outline of presentation

Introduction



Voiced speech segregation based on pitch tracking and
amplitude modulation analysis


Ideal binary mask as CASA goal
Unvoiced speech segregation


Speech segregation problem
Auditory scene analysis (ASA) approach
Auditory segmentation
Neurobiological basis of ASA
How does the auditory system perform ASA?


Information about acoustic features (pitch, spectral
shape, interaural differences, AM, FM) is extracted in
distributed areas of the auditory system
Binding problem: How are these features combined to
form a perceptual whole (stream)?

Hierarchies of feature-detecting cells exist, but do not seem to
constitute a solution to the binding problem
Oscillatory correlation theory for ASA




Neural oscillators are used to represent auditory features
Oscillators representing features of the same source are
synchronized, and are desynchronized from those
representing different sources
Originally proposed by von der Malsburg & Schneider
(1986), and further developed by Wang (1996)
Supported by growing experimental evidence
Oscillatory correlation representation
FD: Feature
Detector
Oscillatory correlation for ASA


LEGION dynamics (Terman & Wang’95) provides a
computational foundation for the oscillatory correlation
theory
The utility of oscillatory correlation has been
demonstrated for speech segregation (Wang-Brown’99),
modeling auditory attention (Wrigley-Brown’04), etc.
Summary


CASA approach to monaural speech segregation
Performs substantially better than previous CASA
systems for voiced speech segregation


Early steps for unvoiced speech segregation



AM cue and target pitch tracking are important for performance
improvement
Auditory segmentation based on onset/offset analysis
Segregation using speech classification
Oscillatory correlation theory for ASA
Acknowledgment


Joint work with Guoning Hu
Funded by AFOSR/AFRL and NSF
Download