The Weft: Auditory Scene Analysis of periodic sounds Motivation: Results and conclusions

advertisement
The Weft: Auditory Scene Analysis of periodic sounds
de
sound
Cochlea
model
commonperiod
objects
Grouping
algorithm
mask
Resynth
onset/
offset
Blackboard
World model
Abstraction
hierarchy
Predict &
combine
pda-bd-pos-1996jul
peripheral channels
Sound elements
predicted
features
Actions
Reconciliation
prediction
error
sound
Front-end
analysis
observed
features
Resynthesis
brn-blkdg2
dpwe 1995oct13
• The prediction-driven approach
[Ellis96] looks for consistency
between the input and the
aggregate predictions of internal
models, which can reflect highlevel context.
• Mid-level representations form a critical bridge between
observable features and abstract structure. Wefts fill a part
of this role.
for ICASSP'97 1997apr15 dpwe@icsi.berkeley.edu
short-time
autocorrelation
frequency channels
sound
Cochlea
filterbank
normalize
& add
Periodogram
time
mega-corello-1997apr
dpwe
lag
Preprocessing
Results and conclusions
• The correlogram represents sound as a 3-dimensional
function of time, frequency and autocorrelation lag.
• In a time-slice of the correlogram, each cell is the shorttime autocorrelation of the envelope of one of the
cochlea filterbank frequency channels.
• The periodogram summarizes the ‘prominence’ of each
periodicity. It is the sum across all frequency channels
of normalized autocorrelations (lag vs. time).
• We use log time (= log frequency) sampling on the lag
axis to match human pitch acuity.
• Extracting the spectrum of a ‘flat’
periodic signal with and without
added noise verifies the basic
algorithm.
• A mixture of two periodic signals
does not completely separate
their spectra. Also, only one
signal is extracted when the
periods collide.
• Male and female voices result in
multiple wefts for each region of
voicing. Octave collisions are
successfully resolved.
Analysis
• A prominent peak in the periodogram triggers the
creation of a new weft, which will track that ‘pitch’
through subsequent time frames. Autocorrelation
aliases of tracked periods are suppressed.
• For a channel excited only by a single periodicity, the
autocorrelation maximum at that period is the average
squared envelope.
• Noise adds in the power domain, but has a nonlinear
effect on the autocorrelation.
• The ratio of peak-to-average autocorrelation varies from
1 for aperiodic input to a maximum specific to each
channel/periodicity pair. We can work backwards from
this robust feature to eliminate the effect of noise on
the autocorrelation peak.
• Tracking wefts through time gives a temporal context
that can be used for interpolation through masking.
Envelope
d·T
T
L
time 0
2T
T
P = d·L2
With noise:
A = (d·L)2
Period
track
Impulse
train generator
Filterbank
Sum
channels
NNLS
compensation
time
−70
0
5
10
15
20
25
30
35
channel
f/Hz
4000
2000
1000
400
200
v1 − alone
v2 − alone
−30
−40
−50
400
−60
200
f/Hz
4000
2000
1000
400
200
v1 − from mix
dB
v2 − from mix
400
200
0.0
0.1
0.2
f/Hz
4000
2000
1000
400
200
1000
400
200
100
50
0.3
0.2
0.1
0.2
0.3
time/s
0.4
−30
−50
−60
0.2
0.4
Weft1
0.0
0.0
−40
0.0
f/Hz
4000
2000
1000
400
200
1000
400
200
100
50
0.4
Voices
0.6
0.8
1.0
1.2
1.4
0.4
0.6
dB
Wefts2−5
0.4
0.6
0.8
1.0
1.2
0.0
0.2
0.8
1.0
1.2
time/s
N
lag
P' = d·M2 + (1-d)·N2
A' = (d·M+ (1-d)·N)2
P', A', d → M, N, L
Why not harmonics?
References
[AssmS90]
[Brown92]
[Ellis96]
channel gains
weft-resyn dpwe 1997apr
−60
Weft extraction is complicated. Why not use a fixed narrow
bandwidth analysis to track individual harmonics, as has proven
successful in the past?
• The ear does not resolve harmonics above the first few.
Presumably, this bias towards finer time resolution has some
evolutionary benefit.
• Tracking the upper spectrum of periodic signals can be difficult
and leads to bulky representations.
• Finding periodicity within broad channels gives different
detectability than decisions made on individual harmonics.
Resynthesis
time
−50
lag
A
d=—
P
M
• A mid-level representation should be perceptually adequate, implying sufficient detail for
independent resynthesis - an important goal.
• A weft is defined by its period track
“Weft”
(time-varying excitation period) and
x
smooth spectrum (energy in each
frequency channel for each time frame).
• Resynthesis is through a simple sourcefilter process.
• Overlap of the adjacent frequency bands is precompensated through Non-negative least
squares inversion (NNLS).
Smooth
spectrum
recovered weft spectrum − noiseless (−) vs. 0 dB SNR (−−)
−30
−40
• Wefts form a compact and plausible representation of periodic
sounds as part of the vocabulary of a computational auditory
scene analysis system.
Autocorrelation
2d·T
Resynthesis
period
... is computer modeling of listeners’ ability to organize sound
mixtures according to their independent sources.
• A popular approach is to
calculate a timefrequency ‘mask’ based
on periodicity & common
onset cues, and use it to
filter out the energy of a single source [Brown92].
• Listeners work at a more abstract level, able to interpret energy
in a single channel as several sources.
frequency
transition
e
envelope
follower
chan
Computational Auditory Scene Analysis (CASA)
Object
formation
lin
Correlogram
• The ‘pitch cue’ (source periodicity) is central to our ability to
attend to individual sounds in a mixture [AssmS90]. The signal
processing behind this ability is not well understood.
• Psychoacoustic pitch phenomena (missing fundamental etc.) are
well modeled by autocorrelation of unresolved harmonics in
broadening frequency channels [MeddH91].
• As part of a computational auditory scene analysis system,
we would like to recover the energy in different frequency
bands due to each periodicity present.
• The auditory system isvery successful at this task. Hoping to
uncover its secrets, we base our signal separation the
correlogram [SlaneyL93], an auditory model including
approximately-constant-Q filtering and autocorrelation.
• The Weft is a discrete element describing individual periodic
sources, and an algorithm for extracting them from the
correlogram representation.
periodic
modulation
lay
dB
Motivation:
Auditory separation of multiple pitches
frequency
channel
Dan Ellis • International Computer Science Institute, Berkeley CA • <dpwe@icsi.berkeley.edu>
[MeddH91]
[SlaneyL93]
Assmann, P. F., Summerfield, Q. (1990). “Modeling the perception of concurrent
vowels: Vowels with different fundamental frequencies,” J. Acous. Soc. Am.
88(2).
Brown, G. J. (1992). “Computational auditory scene analysis : a representational
approach,” Ph.D. thesis, Sheffield University.
Ellis, D. P. W. (1996). “Prediction-driven computational auditory scene analysis,”
Ph.D. dissertation, M.I.T.
Meddis, R., Hewitt, M. J. (1991). “Virtual pitch and phase sensitivity of a computer
model of the auditory periphery. I: Pitch identification,” J. Acous. Soc. Am. 89(6).
Slaney, M., Lyon, R. F. (1993). “On the importance of time - a temporal
representation of sound,” in Visual Representations of Speech Signals, ed. M.
Cooke et al, Wiley.
Download