Audio Signal Recognition for Speech, Music, and Environmental Sounds 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications 4 Observations and Conclusions Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Audio Signal Reecognition 2003-11-13 - 1 / 25 1 Pattern Recognition for Sounds • Pattern recognition is abstraction - continuous signal → discrete labels - an essential part of understanding? “information extraction” • Dan Ellis Sound is a challenging domain - sounds can be highly variable - human listeners are extremely adept Audio Signal Reecognition 2003-11-13 - 2 / 25 Pattern classification • Classes are defined as distinct region in some feature space - e.g. formant frequencies to define vowels F2/Hz 4000 2000 3000 ay ao x 2000 1000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1.0 time/s 0 0 • Issues - finding segments to classify - transforming to an appropriate feature space - defining the class boundaries 2000 F1/Hz Pols vowel formants: "u" (x), "o" (o), "a" (+) 1600 1400 + 1200 1000 o x 800 * 600 0 Dan Ellis 1000 1800 F2 / Hz f/Hz 200 Audio Signal Reecognition 400 new observation x 600 800 1000 F1 / Hz 1200 2003-11-13 - 3 / 25 1400 1600 Classification system parts Sensor signal Pre-processing/ segmentation • STFT • Locate vowels segment Feature extraction • Formant extraction feature vector Classification class Post-processing Dan Ellis • Context constraints • Costs/risk Audio Signal Reecognition 2003-11-13 - 4 / 25 Feature extraction • Feature choice is critical to performance - make important aspects explicit, remove irrelevant details - ‘equivalent’ representations can perform very differently in practice - major opening for domain knowledge (“cleverness”) • Mel-Frequency Cepstral Coefficients (MFCCs): Ubiquitous speech features - DCT of log spectrum on ‘auditory’ scale - approximately decorrelated ... MFCCs cep. coef. freq. channel Mel Spectrogram 30 20 10 10 5 0 0 0 1 Dan Ellis 2 3 0 Audio Signal Reecognition 1 2 time / sec 2003-11-13 - 5 / 25 3 Statistical Interpretation • • Observations are random variables whose distribution depends on the class: Class Observation ωi x (hidden) discrete p(x|ωi) Pr(ωi|x) continuous Source distributions p(x|ωi) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x|ωi) ω 1 ω2 ω3 ω4 x Dan Ellis Audio Signal Reecognition 2003-11-13 - 6 / 25 Priors and posteriors • Bayesian inference can be interpreted as updating prior beliefs with new information, x: Likelihood p ( x ωi ) Pr ( ω i ) ⋅ --------------------------------------------------- = Pr ( ω i x ) ∑ p ( x ω j ) ⋅ Pr ( ω j ) Prior probability j ‘Evidence’ = p(x) Posterior probability • Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = 1) • Minimize the probability of error by choosing maximum a posteriori (MAP) class: ω̂ = argmax Pr ( ω i x ) ωi Dan Ellis Audio Signal Reecognition 2003-11-13 - 7 / 25 Practical implementation • Optimal classifier is ω̂ = argmax Pr ( ω i x ) ωi but we don’t know Pr ( ω i x ) • So, model conditional distributions p ( x ω i ) then use Bayes’ rule to find MAP class p(x|ω1) Labeled training examples {xn,ωxn} Sort according to class Dan Ellis Audio Signal Reecognition Estimate conditional pdf for class ω1 2003-11-13 - 8 / 25 Gaussian models • Model data distributions via parametric model - assume known form, estimate a few parameters • E.g. Gaussian in 1 dimension: 1 1 x – µ i p ( x ω i ) = ---------------- ⋅ exp – --- ------------- 2 σi 2πσ i normalization 2 For higher dimensions, need mean vector µi • and d x d covariance matrix Σi 5 4 1 0.5 3 0 2 4 1 2 4 2 0 0 • Dan Ellis 0 0 1 2 3 4 5 Fit more complex distributions with mixtures... Audio Signal Reecognition 2003-11-13 - 9 / 25 Gaussian models for formant data • Single Gaussians a reasonable fit for this data • Extrapolation of decision boundaries can be surprising Dan Ellis Audio Signal Reecognition 2003-11-13 - 10 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition - How it’s done - What works, and what doesn’t 3 Other Audio Applications 4 Observations and Conclusions Dan Ellis Audio Signal Reecognition 2003-11-13 - 11 / 25 How to recognize speech? freq / Hz 2 • Cross correlate templates? - waveform? - spectrogram? - time-warp problems • Classify short segments as phones (or ...), handle time-warp later - model with slices of ~ 10 ms - pseudo-piecewise-stationary model of words: sil g w eh n sil 4000 3000 2000 1000 0 0 0.05 Dan Ellis 0.1 0.15 0.2 0.25 0.3 Audio Signal Reecognition 0.35 0.4 0.45 time / s 2003-11-13 - 12 / 25 Speech Recognizer Architecture • Almost all current systems are the same: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Dan Ellis Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... Biggest source of improvement is increase in training data - .. along with algorithms to take advantage Audio Signal Reecognition 2003-11-13 - 13 / 25 Speech: Progress • Annual NIST evaluations 30% 3% 1990 1995 2000 2005 - steady progress (?), but still order(s) of magnitude worse than human listeners Dan Ellis Audio Signal Reecognition 2003-11-13 - 14 / 25 Speech: Problems • Natural, spontaneous speech is weird! → • Dan Ellis coarticulation deletions disfluencies is word transcription even a sensible approach? Other major problems - speaking style, rate, accent - environment / background... Audio Signal Reecognition 2003-11-13 - 15 / 25 Speech: What works, what doesn’t • What works: Techniques: - MFCC features + GMM/HMM systems trained with Baum-Welch (EM) - Using lots of training data Domains: - Controlled, low noise environments - Constrained, predictable contexts - Motivated, co-operative users • What doesn’t work: Techniques: - rules based on ‘insight’ - perceptual representations (except when they do...) Domains: - spontaneous, informal speech - unusual accents, voice quality, speaking style - variable, high-noise background / environment Dan Ellis Audio Signal Reecognition 2003-11-13 - 16 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications - Meeting recordings - Alarm sounds - Music signal processing 4 Observations and Conclusions Dan Ellis Audio Signal Reecognition 2003-11-13 - 17 / 25 Other Audio Applications: ICSI Meeting Recordings corpus 3 • Real meetings, 16 channel recordings, 80 hrs - released through NIST/LDC • Classification e.g.: Detecting emphasized utterances based on f0 contour (Kennedy & Ellis ’03) - per-speaker normalized f0 as unidimensional feature → simple threshold classification Dan Ellis 55 110 220 Speaker 1 Audio Signal Reecognition 440 f0/Hz 110 440 1760 Speaker 2 2003-11-13 - 18 / 25 f0/Hz freq / Hz Personal Audio • LifeLog / MyLifeBits / Remembrance Agent: - easy to record everything you hear • Then what? - prohibitive to review - applications if access easier? • Automatic content analysis / indexing... 4 2 freq / Bark 0 50 100 150 200 250 time / min 15 10 5 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30 clock time - find features to classify into e.g. locations Dan Ellis Audio Signal Reecognition 2003-11-13 - 19 / 25 Alarm sound detection Alarm sounds have particular structure - clear even at low SNRs - potential applications... freq / kHz • Restaurant+ alarms (snr 0 ns 6 al 8) 4 3 2 1 0 • Contrast two systems: (Ellis ’01) - standard, global features, P(X|M) - sinusoidal model, fragments, P(M,S|Y) 0 MLP classifier output freq / kHz 0 Sound object classifier output 4 6 9 8 7 3 2 1 0 20 25 30 35 40 45 time/sec 50 - error rates high, but interesting comparisons... Dan Ellis Audio Signal Reecognition 2003-11-13 - 20 / 25 Music signal modeling • Use “machine listener” to navigate large music collections - e.g. unsigned bands on MP3.com • Classification to label: - notes, chords, singing, instruments - .. information to help cluster music • “Artist models” based on feature distributions p 0 0.6 0.2 Electronica fifth cepstral coef 0.4 0 0.2 0.4 0.6 madonna bowie 0.8 1 Dan Ellis 5 10 15 0.5 0 third cepstral coef 0.5 madonna bowie 15 10 Country 5 measure similarity between users’ collections and new music? (Berenzweig & Ellis ’03) Audio Signal Reecognition 2003-11-13 - 21 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications 4 Observations and Conclusions - Model complexity - Sound mixtures Dan Ellis Audio Signal Reecognition 2003-11-13 - 22 / 25 Observations and Conclusions: Training and test data • Balance model/data size to avoid overfitting: Test data error rate Overfitting Training data training or parameters • Diminishing returns from more data: Word Error Rate Mo re t ters rai ame r e pa Mor nin gd ata 44 42 40 WER% 4 Optimal parameter/data ratio 38 36 34 Constant training time 32 9.25 500 18.5 Training set / hours 1000 Hidden layer / units 37 2000 74 4000 Dan Ellis WER for PLP12N-8k nets vs. net size & training data Audio Signal Reecognition 2003-11-13 - 23 / 25 Beyond classification • “No free lunch”: Classifier can only do so much - always need to consider other parts of system • Features - impose ceiling on system performance - improved features allow simpler classifiers • Segmentation / mixtures - e.g. speech-in-noise: only subset of feature dimensions available →missing-data approaches... S2(t) Y(t) S1(t) Dan Ellis Audio Signal Reecognition 2003-11-13 - 24 / 25 Summary • Statistical Pattern Recognition - exploit training data for probabilistically-correct classifications • Speech recognition - successful application of statistical PR - .. but many remaining frontiers • Other audio applications - meetings, alarms, music - classification is information extraction • Current challenges - variability in speech - acoustic mixtures Dan Ellis Audio Signal Reecognition 2003-11-13 - 25 / 25 Extra slides Dan Ellis Audio Signal Reecognition 2003-11-13 - 26 / 25 Neural network classifiers Instead of estimating p ( x ω i ) and using Bayes, can also try to estimate posteriors Pr ( ω i x ) • directly (the decision boundaries) • Sums over nonlinear functions of sums give a large range of decision surfaces... • e.g. Multi-layer perceptron (MLP): y k = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ] j j h1 x1 + wjk + x2 h wij + F[·] 2 x3 + + y1 y2 Input layer • Dan Ellis Output Hidden layer layer Problem is finding the weights wij ... (training) Audio Signal Reecognition 2003-11-13 - 27 / 25 Neural net classifier • Models boundaries, not density p ( x ω i ) • Discriminant training - concentrate on boundary regions - needs to see all classes at once Dan Ellis Audio Signal Reecognition 2003-11-13 - 28 / 25 Why is Speech Recognition hard? • Why not match against a set of waveforms? - waveforms are never (nearly!) the same twice - speakers minimize information/effort in speech • Speech variability comes from many sources: - speaker-dependent (SD) recognizers must handle within-speaker variability - speaker-independent (SI) recognizers must also deal with variation between speakers - all recognizers are afflicted by background noise, variable channels → Need recognition models that: - generalize i.e. accept variations in a range, and - adapt i.e. ‘tune in’ to a particular variant Dan Ellis Audio Signal Reecognition 2003-11-13 - 29 / 25 Within-speaker variability • Timing variation: - word duration varies enormously Frequency 4000 2000 0 0 1.0 0.5 s ow 1.5 2.0 ay aa ax b aw ax ay ih k t s t ih l dx th n th n ih I ABOUT I IT'S STILL THOUGHT THAT THINK AND 2.5 3.0 p aa s b ax l th SO POSSIBLE - fast speech ‘reduces’ vowels • Speaking style variation: - careful/casual articulation - soft/loud speech • Contextual effects: - speech sounds vary with context, role: “How do you do?” Dan Ellis Audio Signal Reecognition 2003-11-13 - 30 / 25 mbma0 freq / Hz Between-speaker variability • Accent variation - regional / mother tongue • Voice quality variation - gender, age, huskiness, nasality • Individual characteristics - mannerisms, speed, prosody 8000 6000 4000 2000 0 8000 6000 fjdm2 4000 2000 0 0 Dan Ellis 0.5 1 1.5 Audio Signal Reecognition 2 2.5 2003-11-13 - 31 / 25 time / s Environment variability • Background noise - fans, cars, doors, papers • Reverberation - ‘boxiness’ in recordings • Microphone channel - huge effect on relative spectral gain Close mic freq / Hz 4000 2000 0 4000 Tabletop mic 2000 0 0 Dan Ellis 0.2 0.4 0.6 0.8 Audio Signal Reecognition 1 1.2 1.4 2003-11-13 - 32 / 25 time / s