Audio Features CS498 Today’s lecture • Audio Features • How we hear sound • How we represent sound – In the context of this class Why features? • Features are a very important area – Bad features make problems unsolvable – Good features make problems trivial • Learning how to pick features is the key – So is understanding what they mean A simple example • Compare two numbers: x,y = {3,3} x,z = {3,100} A simple example • Compare two numbers: x −y = 0 x − z = 97 – x,y similar but x,z not so much • Best way to represent a number is itself! Moving up a level • Compare two vectors: x, y x, z 1.5 1.5 1 1 0.5 0.5 0 0 0 1 2 3 4 5 6 1.5 1.5 1 1 0.5 0.5 0 0 1 2 3 4 5 6 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Moving up a level • Compare two vectors: ∠x, y = 0.03 rad ∠x, z = 0.7 rad x − y = 0.16 x − z = 1.07 – Simply generalizing numbers concept Moving up again • Compare two longer vectors: 1.5 1 0.5 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 1.5 1 0.5 0 Look similar but are not! • Oops! ∠x, y = 1.57 rad, x − y = 7.64 1.5 1 0.5 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 1.5 1 0.5 0 How about this? • Are these two vectors the same? 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 1 2 3 4 5 6 7 4 x 10 0.5 0 −0.5 −1 1 2 3 4 5 6 7 4 x 10 – Not if you look at their norm or angle … Data norms won’t get you far! • You need to articulate what matters – You need to know what matters • Features are the means to do so • Let’s examine what matters to our ears – Our bodies sorta know best Hearing • Sounds and hearing • Human hearing aspects – Physiology and psychology • Lessons learned The hardware (outer/middle ear) • The pinna (auricle) – Aids sound collection – Does directional filtering – Holds earrings, etc … Middle ear Outer ear Ear canal • The ear canal – About 25mm x 7mm – Amplifies sound at ~3kHz by ~10dB – Helps clarify a lot of sounds! • Ear drum – End of middle ear, start of inner ear – Transmits sound as a vibration to the inner ear Ear drum Pinna More hardware (inner ear) • Ear drum (tympanum) Ossicles – Excites the ossicles (ear bones) • Ossicles – – – – • Malleus (hammer), incus (anvil), stapes (stirrup) Transfers vibrations from ear drum to the oval window Amplify sound by ~14dB (peak at ~1kHz) Muscles connected to ossicles control the acoustic reflex (damping in presence of loud sounds) The oval window Oval window Auditory nerve – Transfers vibrations to the cochlea Cochlea • Eustachian tube – Used for pressure equalization Ear drum Eustachian tube The cochlea • The “A/D converter” – Translates oval window vibrations to a neural signal – Fluid filled with the basilar membrane in the middle – Each section of the basilar membrane resonates with a different sound frequency – Vibrations of the basilar membrane move sections of hair cells which send off neural signals to the brain • The cochlea acts like the equalizer display in your stereo – Frequency domain decomposition • Neural signals from the hair cells go to the auditory nerve Microscope photograph of hair cells (yellow) Masking & Critical bands • • When two different sounds excite the same section of the basilar membrane one is masked This is observed at the micro-level – – • There are 24 distinct bands throughout the cochlea – – • a.k.a critical bands Simultaneous excitation on a band by multiple sources results in a single source percept There is also some temporal masking – • E.g. two tones at 150Hz and 170Hz, if one tone is loud enough the other will be inaudible A tone can also hide a noise band when loud enough Preceding sounds mask what’s next This is a feature which is taken into advantage by a lot of audio compression – Throws away stuff you won’t hear due to masking Masking for close frequency tones vs distant tones The neural pathways • • A series of neural stops Cochlear nuclei Ears – Prepping/distribution of neural data from cochlea • – Coincidence detection across ear signals – Localization functions • Cochleas Superior Olivary Complex Inferior Colliculus Cochlear nuclei – Last place where we have most original data – Probably initiates first auditory images in brain • Medial Geniculate Body – Relays various sound features (frequency, intensity, etc) to the auditory cortex • Superior olivary complex Inferior colliculus Auditory Cortex – Reasoning, recognition, identification, etc – High-level processing Medial geniculate body Auditory cortex ? Stream of conciousness … – 20Hz to 20kHz (upper limit decreases with age/trauma) – Infrasound (< 20Hz) can be felt through skin, also as events – Ultrasound (> 20kHz) can be “emotionally” perceived (discomfort, nausea, etc) • Loudness – Low limit is 2x10-10 atm – 0dB SPL to 130dB SPL (but also frequency dependent) • A dynamic range of 3x106 to 1! – 130dB SPL threshold of pain" – 194dB SPL is definition of a shock wave, sounds stops!" Intensity (dB) • Frequency -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 The limits of hearing Pain! Audible sounds Speech Music Inaudibility 16 315 53 125 250 5000 1000 2000 4000 8000 16000 Frequency (Hz) Tones at various frequencies, how high can you hear? Perception of loudness • Loudness is subjective – Perceived loudness changes with frequency – Perception of “twice as loud” is not really that! – Ditto for equal loudness • Fletcher-Munson curves – Equal loudness perception curves through frequencies • • Just noticeable difference is about 1dB SLP 1kHz to 5kHz are the loudest heard frequencies – What the ear canal and ossicles amplify! • Low limit shifts up with age! Perception of pitch • Pitch is another subjective (and arbitrary) measure! • Perception of pitch doubling doesn’t imply doubling of Hz! – Mel scale is the perceptual pitch scale! – Twice as many Mels correspond to a perceived pitch doubling! • Musically useful range varies from 30Hz to 4kHz! • Just noticeable difference is about 0.5% of frequency! – Varies with training though! “Pitch is that attribute of ! auditory sensation in terms ! of which sounds may be ! ordered from low to high”! - American National Standards Institute! Perception of timbre • Timbre is what distinguishes sounds outside of loudness & pitch – Another bogus ANSI description • Timbre is dynamical and can have many facets which can often include pitch and loudness variations – E.g. music instrument identification is guided largely by intensity fluctuations through time • There is not a coherent body of literature examining human timbre perception Gray’s timbre space of musical instruments – But there is a huge bibliography on computational timbre perception! Examples of successive timbre changes. Loudness and pitch are constant So how to we use all that? • All these processes are meaningful – They encapsulate statistics of sounds – They suggest features to use • To make machines that cater to our needs – We need to learn from our perception A lesson from the cochlea • Sounds are not vectors • Sounds are “frequency ensembles” • That’s the “perceptual feature” we care about Like this! – But how do we get this? The “simplest” sound • Sinusoids are special – Simplest waveform – An isolated frequency • A sinusoid has three parameters – Frequency, amplitude & phase • s(t) = a(t) sin( f t + φ) • This simplicity makes sinusoids an excellent building block for most of time series 1 0.5 0 -0.5 -1 0 10 20 30 40 50 60 70 80 90 100 Making a square wave with sines Frequency domain representation • Time series can be decomposed in terms of “sinusoid presence” – See how many sinusoids you can add up to get to a good approximation – Informally called the spectrum • No temporal information in this representation, only frequency information – So a sine with a changing frequency is a smeared spike • Not that great of a representation for dynamically changing sounds Time series Spectrum 1 20 0 10 -1 0 50 100 0 0 20 40 60 0 20 40 60 0 20 40 60 6 1 4 0 2 -1 20 40 60 80 100 0 1 2 0 1 -1 0 50 100 0 Time/frequency representation Many names/varieties A time ordered series of frequency compositions – Can help show how things move in both time and frequency • The most useful representation so far! – Reveals information about the frequency content without sacrificing the time info 1 1 Frequency • Time/Frequency 0 -1 0 50 0 100 200 Time 300 0 100 200 Time 300 1 1 0 -1 20 40 60 80 0.5 0 100 1 1 0 -1 0.5 0 100 Frequency – Spectrogram, sonogram, periodogram, … Time series Frequency • 0 50 100 0.5 0 0 100 200 Time 300 400 A real example 1 Time domain -0.5 -1 Spectrum – We can see a lot of bass and few middle freqs – But where in time are they? • 0 Spectrogram – We can “see” each individual sound – And we know how it sounds like! 0 0.5 1 1.5 2 2.5 3 0 100 200 300 400 500 600 0.6 0.8 1 Time 1.2 2 Frequency domain • 0.5 Time domain – We can see the events – We don’t know how they sound like though! 1.5 1 0.5 0 8000 6000 Frequency • 4000 2000 0 0 0.2 0.4 1.4 1.6 The Discrete Fourier Transform • So how do we get from time domain to frequency domain? – It is a matrix multiplication (a rotation in fact) • The Fourier matrix is square, orthogonal and has complex-valued elements F j,k 1 ijk 2π 1 ⎛ jk2π jk2π ⎞ = e N = cos + isin N N ⎠ N N⎝ • Multiply a vectorized timeseries with the Fourier matrix and voila! The Fourier matrix (real part) How does the DFT work? • Multiplying with the Fourier matrix – We dot product each Fourier row vector with the input – If two vectors point the same way their dot product is maximized • Each Fourier row picks out a single sinusoid from the signal – In fact a complex sinusoid – Since all the Fourier sinusoids are orthogonal there is no overlap • The resulting vector contains how much of each Fourier sinusoid the original vector had in it The DFT in a little more detail • – Doesn’t have to, but it is convenient for other things • The DFT result for real signals is conjugate symmetric – The middle value is the highest frequency (Nyquist) – Working towards the edges we traverse all frequencies downwards – The two sides are mutually conjugate complex numbers • The interesting parts of the DFT are the magnitude and the phase – Abs( F) = || F || – Arg( F) = ∡ F • Real and imaginary parts of the DFT of a sine The DFT features complex numbers To go back we apply the DFT again (with some scaling) 200 0 -200 100 200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000 100 0 -100 100 Corresponding magnitude and phase 300 200 100 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 2 0 -2 Size of a DFT • The bigger the DFT input the more frequency resolution – But the more data we need! • Zero padding helps – Stuff a lot of zeros at the end of the input to make up for few data – But we don’t really infuse any more information we just make prettier plots From the DFT to a spectrogram The spectrogram is a series of consecutive magnitude DFTs on a signal – This series is taken off consecutive segments of the input – This reduces “fake” broadband noise estimates • It is wise to make the segments overlap – Due to windowing • The parameters to use are – The DFT size – The overlap amount – The windowing function 0.5 0 -0.5 -1 0 It is best to taper the ends of the segments 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Magnitude of DFT of every segment (segments can overlap) … Time series of magnitude spectra • 120 100 80 60 40 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Looks nicer as an image 120 Spectrogram • Input sound 1 100 80 60 40 20 2 4 6 8 10 12 14 16 18 Why window? – Start and end point must taper to zero • Windowing – Eliminates the sharp edges that cause broadband noise 0.04 0.02 0 -0.02 -0.04 -0.06 200 400 600 800 1000 1200 200 400 600 800 1000 1200 Windowed • Discontinuities at ends cause noise Not windowed Nasty sharp edges 0.04 0.02 0 -0.02 -0.04 4 x 10 Windowed 1 0 0 4 x 10 0.2 0.4 0.6 0.8 1 Time 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 Time 1.2 1.4 1.6 1.8 2 1 0 Non-existent broadband content Not windowed Frequency – Since we have windowed we need to take overlapping segments to make up for the attenuated parts of the input 2 Frequency • Overlap Time/Frequency tradeoff • Heisenberg’s uncertainty principle – We can’t accurately know both the frequency and the time position of a wave – Also in particle physics with speed and position of a particle • Spectrogram problems – Big DFTs sacrifice temporal resolution – Small DFTs have lousy frequency resolution • We can use a denser overlap to compensate – Ok solution, not great The Fast Fourier Transform (FFT) • The Fourier matrix is special The Fourier matrix, N = 32 – Many repeating values – Unique repeating structure • We can decompose a Fourier transform to two Fourier transforms of half the size – Also includes some twiddling with the data – Two Fourier smaller transforms are faster than one big one – We keep decomposing it until we have a very small DFT • This results into a really fast algorithm that has driven communications forward! – The constraint is that the transform size is best if a power of two so that we can decompose it repeatedly Example FFT, N = 8 Emulating the cochlea • Using the time/frequency domain • Take successive Fourier transforms 0.8 0.6 0.4 0.2 • Keep their magnitude 0 −0.2 −0.4 −0.6 Stack them in time • Now you can visually compare sounds! 1 2 3 4 5 6 7 8 4 x 10 Frequency • 26 51 77 102 128 Time (1k samples) 154 179 205 Back to our example 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 1 2 3 4 5 6 7 4 x 10 0.5 0 −0.5 −1 1 2 3 4 5 6 7 4 x 10 Corresponding spectrograms Spectrogram 500 Amplitude 400 300 200 100 5 10 15 20 25 30 Time 35 40 45 50 55 35 40 45 50 55 Spectrogram 500 Amplitude 400 300 200 100 5 10 15 20 25 30 Time A lesson from loudness perception • We don’t perceive loudness linearly • How much louder is the second “test”? • The magnitude we plot should be logarithmic, not linear Log spectrograms Log spectrogram 500 Amplitude 400 300 200 100 5 10 15 20 25 30 Time 35 40 45 50 55 35 40 45 50 55 Log spectrogram 500 Amplitude 400 300 200 100 5 10 15 20 25 30 Time A lesson from pitch perception • Frequencies are not “linear” – Perceived scale is called mel • Use that spacing instead – i.e. warp the frequency axis “Mel spectra” Log mel spectrogram 35 Amplitude 30 25 20 15 10 5 5 10 15 20 25 30 Time 35 40 45 50 55 40 45 50 55 Log mel spectrogram 35 Amplitude 30 25 20 15 10 5 5 10 15 20 25 30 Time 35 One more trick • Mel cepstra – Smooth the log mel spectra using one more frequency transform (the DCT) Mel cepstra Amplitude 20 15 10 5 5 10 15 20 25 30 Time 35 40 45 50 55 35 40 45 50 55 Mel cepstra Amplitude 20 15 10 5 5 10 15 20 25 30 Time Adding some temporal info • Deltas and delta-deltas – In sounds order is important – Using “delta features” we pay attention to change Mel cepstra 35 Coefficient 30 25 20 15 10 5 5 10 15 20 25 30 Time 35 40 45 50 55 35 40 45 50 55 Mel cepstra 35 Coefficient 30 25 20 15 10 5 5 10 15 20 25 30 Time What more is there? • Tons! – – – – – Spectral features Waveform features Higher level features Perceptual parameter features … Sound recap • Go to time/frequency domain – We do so in the cochlea • Frequencies are not linear – We perceive them in another scale • Amplitude is not linear either – Use log scale instead • Resulting features are used a lot – Further minor tweaks exist (more later) Next lecture • Principal Component Analysis • How to find features automatically • How to “compress” data without info loss