Augmenting and Exploiting Auditory Perception for Complex Scene Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Human Auditory Scene Analysis Computational Acoustic Scene Analysis Virtual and Augmented Audio Future Audio Analysis & Display Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 1 /23 1. Human Auditory Scene Analysis “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Now: Hearing as the model for machine perception • Future: Machines to enhance human perception Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 2 /23 Auditory Scene Analysis • Listeners organize sound mixtures Bregman ’90 Darwin & Carlyon ’95 common onset + continuity harmonicity freq / kHz into discrete perceived sources based on within-signal cues (audio + ...) 4 30 20 3 10 0 2 -10 -20 1 -30 0 -40 0 spatial, modulation, ... learned “schema” 0.5 Auditory Perception for Complex Scenes - Dan Ellis 1 1.5 2 2.5 3 3.5 4 time / sec level / dB reynolds-mcadams-dpwe.wav 2013-03-27 3 /23 Spatial Hearing • People perceive sources based on cues Blauert ’96 spatial (binaural): ITD, ILD R L head shadow (high freq) path length difference source shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 2.205 2.21 2.215 Auditory Perception for Complex Scenes - Dan Ellis 2.22 2.225 2.23 2.235 time / s 2013-03-27 4 /23 Human Performance: Spatial Info • Task: Coordinate Response Measure “Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron” Brungart et al.’02 crm-11737+16515.wav • Accuracy as a function of spatial separation: A, B same speaker Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 5 /23 Human Performance: Source Info Brungart et al.’01 varying the level and voice character !"#$%&$'()#*"&+$,-"#$'(!$./-#0 • CRM 1232'(4-**2&2#52. (same spatial location) 1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"# energetic vs. informational masking 1-.,2#(*"&(,72(9%-2,2&(,$'/2&: Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 6 /23 Human Hearing: Limitations Intensity / dB • Sensor number: just 2 ears • Sensor location: short, horizontal baseline • Sensor performance: local dynamic range absolute threshold masking tone masked threshold log freq • Processing: Attention & Memory limits integration time Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 7 /23 2. Computational Scene Analysis • Central idea: Segment time-frequency into sources based on perceptual grouping cues Segment input mixture Front end signal features (maps) Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ... Group Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod ... principal cue is harmonicity Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 8 /23 Spatial Info: Microphone Arrays Benesty, Chen, Huang ’08 • If interference is diffuse, can simply boost energy from target direction e.g. shotgun mic - delay-and-sum 0 -20 -40 ∆x = c . D λ = 4D λ = 2D λ=D + D + D + D off-axis spectral coloration many variants - filter & sum, sidelobe cancelation ... Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 9 /23 Independent Component Analysis Bell & Sejnowski ’95 Smaragdis ’98 • Separate “blind” combinations by maximizing independence of outputs m1 m2 x a11 a21 a12 a22 s1 s2 −δ MutInfo δa Mixture Scatter kurt(y) = E y µ 4 as a measure of independence? 3 s2 0.6 0.4 kurtosis mix 2 kurtosis Kurtosis vs. 0.8 s1 12 10 8 0.2 6 0 4 -0.2 -0.4 2 -0.6 -0.3 0 s1 -0.2 -0.1 0 0.1 0.2 0.3 0.4 mix 1 Auditory Perception for Complex Scenes - Dan Ellis 0 0.2 0.4 0.6 s2 0.8 / 1 2013-03-27 10 /23 Environmental Scene Analysis • Find the pieces a listener would report Ellis ’96 f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 f/Hz Wefts1 4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9 12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 40 400 200 50 60 Squeal (6/10) Truck (7/10) 70 0 1 2 Auditory Perception for Complex Scenes - Dan Ellis 3 4 5 6 7 8 9 time/s dB 2013-03-27 11 /23 “Superhuman” Speech Analysis ABSTRACT We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75% over the baseline, and 15.90% over the equivalent low-resolution algorithm. There are two observations that motivate the consideration of high frequency resolution modelling of speech for noise robust speech recognition and enhancement. First is the observation that most noise sources do not have harmonic structureKristjansson, similar to that ofHershey voiced speech. et al.Hence, ’06 voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal space1 . • IBM’s 2006 Iroquois speech separation system detailed state combinations large speech recognizer exploits grammar constraints 34 per-speaker models • “Superhuman” performance 45 35 30 25 20 1. INTRODUCTION 15 x estimate noisy vector clean vector A long standing goal in speech enhancement and robust 10 speech recognition has been to exploit the harmonic struc0 200 400 600 800 1000 Frequency [Hz] ture of speech to improve intelligibility and increase recognition accuracy. Fig. 1. The noisy input vector (dot-dash line), the correThe source-filter model of speech that speech Sameassumes Gender Speakers sponding clean vector (solid line) and the estimate of the 100 is produced by an excitation source (the vocal cords) which clean speech (dotted line), with shaded area indicating the has strong regular harmonic structure during voiced phonemes. uncertainty of the estimate (one standard deviation). Notice The overall shape of the spectrum is then formed by a filthat the uncertainty on the estimate is considerably larger in 50 ter (the vocal tract). In non-tonal languages the filter shape the valleys between the harmonic peaks. This reflects the alone determines which phone component of a word is prolower SNR in these regions. The vector shown is frame 100 duced (see Figure 2). The source on the other hand introfrom Figure 2 0 duces fine structure in the frequency spectrum that in many dB 3 dB different0utterances dB - 3the dB same - 6 dB - 9 dB cases varies6 strongly among of A second observation is that in voiced speech, the signal phone.SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human power is concentrated in areas near the harmonics of the This fact has traditionally inspired the use of smooth fundamental frequency, which show up as parallel ridges in representations of the speech spectrum, such as the Melfrequency cepstral coefficients, in an attempt to accurately 1 Even if the interfering signal is another speaker, the harmonic structure estimate the filter component of speech in a way that is inof the two signals may differ at different times, and the long term pitch variant to the non-phonetic effects of the excitation[1]. contour of the speakers may be exploited to separate the two sources [2]. Complex Scenes - Dan Ellis 2013-03-27 12 /23 ... in some conditions Auditory Perception for Noisy vector, clean vector and estimate of clean vector 40 Energy [dB] Key features: Meeting Recorders Janin et al. ’03 Ellis & Liu ’04 • Distributed mics in meeting room • Between-mic correlations locate sources Inferred talker positions (x=mic) mr-2000-11-02-1440: PZM xcorr lags 1 4 lag 3-4 / ms 0 1 -1 4 2 z/m 0.5 3 -2 -3 -3 -2 0 2 2 1 -1 0 1 lag 1-2 / ms 2 3 2 3 Auditory Perception for Complex Scenes - Dan Ellis 1 1 0 y/m -1 0 x/m -2 -2 2013-03-27 13 /23 Environmental Sound Classification Ellis, Zheng, McDermott ’11 • Trained models using e.g. “texture” features \x\ Sound Automatic gain control \x\ mel filterbank (18 chans) \x\ \x\ \x\ \x\ • Paradoxical results FFT Envelope correlation Auditory Perception for Complex Scenes - Dan Ellis Histogram Octave bins 0.5,1,2,4,8,16 Hz Modulation energy (18 x 6) mean, var, skew, kurt (18 x 4) Cross-band correlations (318 samples) 2013-03-27 14 /23 Audio Lifelogs Lee & Ellis ’04 • Body-worn continuous recording 09:00 • 09:30 Long time windows for episode-scale segmentation, clustering, and classification 10:00 10:30 11:00 11:30 2004-09-13 preschool cafe preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 2004-09-14 L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 16:00 office office office postlec office Lesser 16:30 17:00 17:30 18:00 Auditory Perception for Complex Scenes - Dan Ellis outdoor lab cafe 2013-03-27 15 /23 Machines: Current Limitations • Separating overlapping sources blind source separation • Separating individual events segmentation • Learning & classifying source categories recognition of individual sounds and classes Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 16 /23 3.Virtual and Augmented Audio Brown & Duda ’98 • Audio signals can be effectively spatialized by convolving with Head-Related Impulse Responses (HRIRs) HRIR_021 Left @ 0 el HRIR_021 Right @ 0 el 1 45 0 0 1 -45 0 Azimuth / deg RIGHT LEFT 0 0.5 1 1.5 0 0.5 1 1.5 -1 time / ms 0 HRIR_021 Left @ 0 el 0 az HRIR_021 Right @ 0 el 0 az 0.5 1 1.5 time / ms • Auditory localization also uses head-motion cues Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 17 /23 Augmented Audio Reality • Pass-through and/or mix-in Härmä et al. ’04 oc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4 Hearium ’12 Figure 3: Filter TheHearium headset (especially Journal of Electrical and Computer Engineering gure 2: Left: Headset used in the evaluation. Microphone is on e left and headphone outlet on the right. Middle: Headset fitted ear. Right: Prototype ARA mixer. Virtual Signal to sounds Virtual acoustic processing ARA mixer Preprocessing for transmission distant user Although multimodal communication is in most cases prerred over a single modality, one of the strengths of ARA techques is that they can be used practically anywhere, anytime, Figure 1: Block diagram of an ARA system. nds-free, and eyes-free. This paper introduces results ofScenes a pilot- study on the usability of Auditory Perception for Complex Dan Ellis and high frequencies comin ever, there is always some between the skin and the cu quencies can leak through leaking from the real environ pseudo-acoustic representat summing causes coloration riorates the pseudo-acoustic QuietPro+ low frequencies has to be eq 2013-03-27 18 /23 4. Future Audio Analysis & Display • Better Scene Analysis overcoming the limitations of human hearing: sensors, geometry Sound mixture Object 1 percept Object 2 percept Object 3 percept • Challenges fine source discrimination modeling & classification (language ID?) Integrating through time: single location, sparse sounds Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 19 /23 Ad-Hoc Mic Array • Multiple sensors, real-time sharing long-baseline beamforming • Challenges precise relative (dynamic) localization precise absolute registration Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 20 /23 Sound Visualization • Making acoustic information visible O’Donovan et al. ’07 “synesthesia” www.ultra-gunfirelocator.com • Challenges source formation & classification registration: sensors, display Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 21 /23 Auditory Display • Acoustic channel complements vision acoustic alarms verbal information “zoomed” ambience instant replay http://www.youtube.com/watch?v=v1uyQZNg2vE • Challenges Information management & prioritization maximally exploit perceptual organization Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 22 /23 Summary • Human Scene Analysis Spatial & Source information • Computational Scene Analysis Spatial & Source information World knowledge • Augmented Audition Selective pass-through + insertion Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 23 /23 References • • • • • • • • • • • • • • • • • • • • A. Bregman, Auditory Scene Analysis, MIT Press, 1990. C. Darwin & R. Carlyon, “Auditory grouping” Hbk of Percep. & Cogn. 6: Hearing, 387–424, Academic Press, 1995. J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996. D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002. D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001. G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8(4), 297–336, 1994. H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication 27, 299–310, 1999. G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004. J. Benesty, J. Chen,Y. Huang, Microphone Array Signal Processing, Springer, 2008. A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995. P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998. D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996. T Kristjansson, J Hershey, P Olsen, S Rennie, R Gopinath, Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system, Proce. of Interspeech, 97-100, 2006. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI Meeting Corpus, Proc. ICASSP-03, pp. I-364--367, 2003. D. Ellis and J. Liu, Speaker turn segmentation based on between-channel differences, NIST Meeting Recognition Workshop @ ICASSP, pp. 112-117, Montreal, May 2004. D. Ellis, X. Zheng, and J. McDermott, Classifying soundtracks with audio texture features, Proc. IEEE ICASSP, pp. 5880-5883, Prague, May 2011. D. Ellis and K.S. Lee, Minimal-Impact Audio-Based Personal Archives, First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, pp. 39-47, Oct 2004. C. P. Brown, R. O. Duda, A structural model for binaural sound synthesis, IEEE Tr. Speech & Audio 6(5):476-488, 1998. A. Härma, J. Julia, M. Tikander, M. Karjalainen, T. Lokki, J, Hiipakka, G. Lorho, Augmented reality audio for mobile and wearable appliances, J. Aud Eng. Soc. 52(6): 618-639, 2004. A. O'Donovan, R. Duraiswami, J. Neumann, Microphone arrays as generalized cameras for integrated audio visual processing, IEEE CVPR, 1-8, 2007. Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 24 /23