The Weft: Auditory Scene Analysis of periodic sounds de sound Cochlea model commonperiod objects Grouping algorithm mask Resynth onset/ offset Blackboard World model Abstraction hierarchy Predict & combine pda-bd-pos-1996jul peripheral channels Sound elements predicted features Actions Reconciliation prediction error sound Front-end analysis observed features Resynthesis brn-blkdg2 dpwe 1995oct13 • The prediction-driven approach [Ellis96] looks for consistency between the input and the aggregate predictions of internal models, which can reflect highlevel context. • Mid-level representations form a critical bridge between observable features and abstract structure. Wefts fill a part of this role. for ICASSP'97 1997apr15 dpwe@icsi.berkeley.edu short-time autocorrelation frequency channels sound Cochlea filterbank normalize & add Periodogram time mega-corello-1997apr dpwe lag Preprocessing Results and conclusions • The correlogram represents sound as a 3-dimensional function of time, frequency and autocorrelation lag. • In a time-slice of the correlogram, each cell is the shorttime autocorrelation of the envelope of one of the cochlea filterbank frequency channels. • The periodogram summarizes the ‘prominence’ of each periodicity. It is the sum across all frequency channels of normalized autocorrelations (lag vs. time). • We use log time (= log frequency) sampling on the lag axis to match human pitch acuity. • Extracting the spectrum of a ‘flat’ periodic signal with and without added noise verifies the basic algorithm. • A mixture of two periodic signals does not completely separate their spectra. Also, only one signal is extracted when the periods collide. • Male and female voices result in multiple wefts for each region of voicing. Octave collisions are successfully resolved. Analysis • A prominent peak in the periodogram triggers the creation of a new weft, which will track that ‘pitch’ through subsequent time frames. Autocorrelation aliases of tracked periods are suppressed. • For a channel excited only by a single periodicity, the autocorrelation maximum at that period is the average squared envelope. • Noise adds in the power domain, but has a nonlinear effect on the autocorrelation. • The ratio of peak-to-average autocorrelation varies from 1 for aperiodic input to a maximum specific to each channel/periodicity pair. We can work backwards from this robust feature to eliminate the effect of noise on the autocorrelation peak. • Tracking wefts through time gives a temporal context that can be used for interpolation through masking. Envelope d·T T L time 0 2T T P = d·L2 With noise: A = (d·L)2 Period track Impulse train generator Filterbank Sum channels NNLS compensation time −70 0 5 10 15 20 25 30 35 channel f/Hz 4000 2000 1000 400 200 v1 − alone v2 − alone −30 −40 −50 400 −60 200 f/Hz 4000 2000 1000 400 200 v1 − from mix dB v2 − from mix 400 200 0.0 0.1 0.2 f/Hz 4000 2000 1000 400 200 1000 400 200 100 50 0.3 0.2 0.1 0.2 0.3 time/s 0.4 −30 −50 −60 0.2 0.4 Weft1 0.0 0.0 −40 0.0 f/Hz 4000 2000 1000 400 200 1000 400 200 100 50 0.4 Voices 0.6 0.8 1.0 1.2 1.4 0.4 0.6 dB Wefts2−5 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.8 1.0 1.2 time/s N lag P' = d·M2 + (1-d)·N2 A' = (d·M+ (1-d)·N)2 P', A', d → M, N, L Why not harmonics? References [AssmS90] [Brown92] [Ellis96] channel gains weft-resyn dpwe 1997apr −60 Weft extraction is complicated. Why not use a fixed narrow bandwidth analysis to track individual harmonics, as has proven successful in the past? • The ear does not resolve harmonics above the first few. Presumably, this bias towards finer time resolution has some evolutionary benefit. • Tracking the upper spectrum of periodic signals can be difficult and leads to bulky representations. • Finding periodicity within broad channels gives different detectability than decisions made on individual harmonics. Resynthesis time −50 lag A d=— P M • A mid-level representation should be perceptually adequate, implying sufficient detail for independent resynthesis - an important goal. • A weft is defined by its period track “Weft” (time-varying excitation period) and x smooth spectrum (energy in each frequency channel for each time frame). • Resynthesis is through a simple sourcefilter process. • Overlap of the adjacent frequency bands is precompensated through Non-negative least squares inversion (NNLS). Smooth spectrum recovered weft spectrum − noiseless (−) vs. 0 dB SNR (−−) −30 −40 • Wefts form a compact and plausible representation of periodic sounds as part of the vocabulary of a computational auditory scene analysis system. Autocorrelation 2d·T Resynthesis period ... is computer modeling of listeners’ ability to organize sound mixtures according to their independent sources. • A popular approach is to calculate a timefrequency ‘mask’ based on periodicity & common onset cues, and use it to filter out the energy of a single source [Brown92]. • Listeners work at a more abstract level, able to interpret energy in a single channel as several sources. frequency transition e envelope follower chan Computational Auditory Scene Analysis (CASA) Object formation lin Correlogram • The ‘pitch cue’ (source periodicity) is central to our ability to attend to individual sounds in a mixture [AssmS90]. The signal processing behind this ability is not well understood. • Psychoacoustic pitch phenomena (missing fundamental etc.) are well modeled by autocorrelation of unresolved harmonics in broadening frequency channels [MeddH91]. • As part of a computational auditory scene analysis system, we would like to recover the energy in different frequency bands due to each periodicity present. • The auditory system isvery successful at this task. Hoping to uncover its secrets, we base our signal separation the correlogram [SlaneyL93], an auditory model including approximately-constant-Q filtering and autocorrelation. • The Weft is a discrete element describing individual periodic sources, and an algorithm for extracting them from the correlogram representation. periodic modulation lay dB Motivation: Auditory separation of multiple pitches frequency channel Dan Ellis • International Computer Science Institute, Berkeley CA • <dpwe@icsi.berkeley.edu> [MeddH91] [SlaneyL93] Assmann, P. F., Summerfield, Q. (1990). “Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies,” J. Acous. Soc. Am. 88(2). Brown, G. J. (1992). “Computational auditory scene analysis : a representational approach,” Ph.D. thesis, Sheffield University. Ellis, D. P. W. (1996). “Prediction-driven computational auditory scene analysis,” Ph.D. dissertation, M.I.T. Meddis, R., Hewitt, M. J. (1991). “Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification,” J. Acous. Soc. Am. 89(6). Slaney, M., Lyon, R. F. (1993). “On the importance of time - a temporal representation of sound,” in Visual Representations of Speech Signals, ed. M. Cooke et al, Wiley.