Audio Signal Recognition for Speech, Music, and Environmental Sounds

Audio Signal Recognition for Speech, Music, and Environmental Sounds 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications 4 Observations and Conclusions Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Audio Signal Reecognition 2003-11-13 - 1 / 25 1 Pattern Recognition for Sounds • Pattern recognition is abstraction - continuous signal → discrete labels - an essential part of understanding? “information extraction” • Dan Ellis Sound is a challenging domain - sounds can be highly variable - human listeners are extremely adept Audio Signal Reecognition 2003-11-13 - 2 / 25 Pattern classification • Classes are defined as distinct region in some feature space - e.g. formant frequencies to define vowels F2/Hz 4000 2000 3000 ay ao x 2000 1000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1.0 time/s 0 0 • Issues - finding segments to classify - transforming to an appropriate feature space - defining the class boundaries 2000 F1/Hz Pols vowel formants: "u" (x), "o" (o), "a" (+) 1600 1400 + 1200 1000 o x 800 * 600 0 Dan Ellis 1000 1800 F2 / Hz f/Hz 200 Audio Signal Reecognition 400 new observation x 600 800 1000 F1 / Hz 1200 2003-11-13 - 3 / 25 1400 1600 Classification system parts Sensor signal Pre-processing/ segmentation • STFT • Locate vowels segment Feature extraction • Formant extraction feature vector Classification class Post-processing Dan Ellis • Context constraints • Costs/risk Audio Signal Reecognition 2003-11-13 - 4 / 25 Feature extraction • Feature choice is critical to performance - make important aspects explicit, remove irrelevant details - ‘equivalent’ representations can perform very differently in practice - major opening for domain knowledge (“cleverness”) • Mel-Frequency Cepstral Coefficients (MFCCs): Ubiquitous speech features - DCT of log spectrum on ‘auditory’ scale - approximately decorrelated ... MFCCs cep. coef. freq. channel Mel Spectrogram 30 20 10 10 5 0 0 0 1 Dan Ellis 2 3 0 Audio Signal Reecognition 1 2 time / sec 2003-11-13 - 5 / 25 3 Statistical Interpretation • • Observations are random variables whose distribution depends on the class: Class Observation ωi x (hidden) discrete p(x|ωi) Pr(ωi|x) continuous Source distributions p(x|ωi) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x|ωi) ω 1 ω2 ω3 ω4 x Dan Ellis Audio Signal Reecognition 2003-11-13 - 6 / 25 Priors and posteriors • Bayesian inference can be interpreted as updating prior beliefs with new information, x: Likelihood p ( x ωi ) Pr ( ω i ) ⋅ --------------------------------------------------- = Pr ( ω i x ) ∑ p ( x ω j ) ⋅ Pr ( ω j ) Prior probability j ‘Evidence’ = p(x) Posterior probability • Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = 1) • Minimize the probability of error by choosing maximum a posteriori (MAP) class: ω̂ = argmax Pr ( ω i x ) ωi Dan Ellis Audio Signal Reecognition 2003-11-13 - 7 / 25 Practical implementation • Optimal classifier is ω̂ = argmax Pr ( ω i x ) ωi but we don’t know Pr ( ω i x ) • So, model conditional distributions p ( x ω i ) then use Bayes’ rule to find MAP class p(x|ω1) Labeled training examples {xn,ωxn} Sort according to class Dan Ellis Audio Signal Reecognition Estimate conditional pdf for class ω1 2003-11-13 - 8 / 25 Gaussian models • Model data distributions via parametric model - assume known form, estimate a few parameters • E.g. Gaussian in 1 dimension: 1 1  x – µ i p ( x ω i ) = ---------------- ⋅ exp – ---  ------------- 2  σi  2πσ i normalization 2 For higher dimensions, need mean vector µi • and d x d covariance matrix Σi 5 4 1 0.5 3 0 2 4 1 2 4 2 0 0 • Dan Ellis 0 0 1 2 3 4 5 Fit more complex distributions with mixtures... Audio Signal Reecognition 2003-11-13 - 9 / 25 Gaussian models for formant data • Single Gaussians a reasonable fit for this data • Extrapolation of decision boundaries can be surprising Dan Ellis Audio Signal Reecognition 2003-11-13 - 10 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition - How it’s done - What works, and what doesn’t 3 Other Audio Applications 4 Observations and Conclusions Dan Ellis Audio Signal Reecognition 2003-11-13 - 11 / 25 How to recognize speech? freq / Hz 2 • Cross correlate templates? - waveform? - spectrogram? - time-warp problems • Classify short segments as phones (or ...), handle time-warp later - model with slices of ~ 10 ms - pseudo-piecewise-stationary model of words: sil g w eh n sil 4000 3000 2000 1000 0 0 0.05 Dan Ellis 0.1 0.15 0.2 0.25 0.3 Audio Signal Reecognition 0.35 0.4 0.45 time / s 2003-11-13 - 12 / 25 Speech Recognizer Architecture • Almost all current systems are the same: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Dan Ellis Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... Biggest source of improvement is increase in training data - .. along with algorithms to take advantage Audio Signal Reecognition 2003-11-13 - 13 / 25 Speech: Progress • Annual NIST evaluations 30% 3% 1990 1995 2000 2005 - steady progress (?), but still order(s) of magnitude worse than human listeners Dan Ellis Audio Signal Reecognition 2003-11-13 - 14 / 25 Speech: Problems • Natural, spontaneous speech is weird! → • Dan Ellis coarticulation deletions disfluencies is word transcription even a sensible approach? Other major problems - speaking style, rate, accent - environment / background... Audio Signal Reecognition 2003-11-13 - 15 / 25 Speech: What works, what doesn’t • What works: Techniques: - MFCC features + GMM/HMM systems trained with Baum-Welch (EM) - Using lots of training data Domains: - Controlled, low noise environments - Constrained, predictable contexts - Motivated, co-operative users • What doesn’t work: Techniques: - rules based on ‘insight’ - perceptual representations (except when they do...) Domains: - spontaneous, informal speech - unusual accents, voice quality, speaking style - variable, high-noise background / environment Dan Ellis Audio Signal Reecognition 2003-11-13 - 16 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications - Meeting recordings - Alarm sounds - Music signal processing 4 Observations and Conclusions Dan Ellis Audio Signal Reecognition 2003-11-13 - 17 / 25 Other Audio Applications: ICSI Meeting Recordings corpus 3 • Real meetings, 16 channel recordings, 80 hrs - released through NIST/LDC • Classification e.g.: Detecting emphasized utterances based on f0 contour (Kennedy & Ellis ’03) - per-speaker normalized f0 as unidimensional feature → simple threshold classification Dan Ellis 55 110 220 Speaker 1 Audio Signal Reecognition 440 f0/Hz 110 440 1760 Speaker 2 2003-11-13 - 18 / 25 f0/Hz freq / Hz Personal Audio • LifeLog / MyLifeBits / Remembrance Agent: - easy to record everything you hear • Then what? - prohibitive to review - applications if access easier? • Automatic content analysis / indexing... 4 2 freq / Bark 0 50 100 150 200 250 time / min 15 10 5 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30 clock time - find features to classify into e.g. locations Dan Ellis Audio Signal Reecognition 2003-11-13 - 19 / 25 Alarm sound detection Alarm sounds have particular structure - clear even at low SNRs - potential applications... freq / kHz • Restaurant+ alarms (snr 0 ns 6 al 8) 4 3 2 1 0 • Contrast two systems: (Ellis ’01) - standard, global features, P(X|M) - sinusoidal model, fragments, P(M,S|Y) 0 MLP classifier output freq / kHz 0 Sound object classifier output 4 6 9 8 7 3 2 1 0 20 25 30 35 40 45 time/sec 50 - error rates high, but interesting comparisons... Dan Ellis Audio Signal Reecognition 2003-11-13 - 20 / 25 Music signal modeling • Use “machine listener” to navigate large music collections - e.g. unsigned bands on MP3.com • Classification to label: - notes, chords, singing, instruments - .. information to help cluster music • “Artist models” based on feature distributions p 0 0.6 0.2 Electronica fifth cepstral coef 0.4 0 0.2 0.4 0.6 madonna bowie 0.8 1 Dan Ellis 5 10 15 0.5 0 third cepstral coef 0.5 madonna bowie 15 10 Country 5 measure similarity between users’ collections and new music? (Berenzweig & Ellis ’03) Audio Signal Reecognition 2003-11-13 - 21 / 25 Outline 1 Pattern Recognition for Sounds 2 Speech Recognition 3 Other Audio Applications 4 Observations and Conclusions - Model complexity - Sound mixtures Dan Ellis Audio Signal Reecognition 2003-11-13 - 22 / 25 Observations and Conclusions: Training and test data • Balance model/data size to avoid overfitting: Test data error rate Overfitting Training data training or parameters • Diminishing returns from more data: Word Error Rate Mo re t ters rai ame r e pa Mor nin gd ata 44 42 40 WER% 4 Optimal parameter/data ratio 38 36 34 Constant training time 32 9.25 500 18.5 Training set / hours 1000 Hidden layer / units 37 2000 74 4000 Dan Ellis WER for PLP12N-8k nets vs. net size & training data Audio Signal Reecognition 2003-11-13 - 23 / 25 Beyond classification • “No free lunch”: Classifier can only do so much - always need to consider other parts of system • Features - impose ceiling on system performance - improved features allow simpler classifiers • Segmentation / mixtures - e.g. speech-in-noise: only subset of feature dimensions available →missing-data approaches... S2(t) Y(t) S1(t) Dan Ellis Audio Signal Reecognition 2003-11-13 - 24 / 25 Summary • Statistical Pattern Recognition - exploit training data for probabilistically-correct classifications • Speech recognition - successful application of statistical PR - .. but many remaining frontiers • Other audio applications - meetings, alarms, music - classification is information extraction • Current challenges - variability in speech - acoustic mixtures Dan Ellis Audio Signal Reecognition 2003-11-13 - 25 / 25 Extra slides Dan Ellis Audio Signal Reecognition 2003-11-13 - 26 / 25 Neural network classifiers Instead of estimating p ( x ω i ) and using Bayes, can also try to estimate posteriors Pr ( ω i x ) • directly (the decision boundaries) • Sums over nonlinear functions of sums give a large range of decision surfaces... • e.g. Multi-layer perceptron (MLP): y k = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ] j j h1 x1 + wjk + x2 h wij + F[·] 2 x3 + + y1 y2 Input layer • Dan Ellis Output Hidden layer layer Problem is finding the weights wij ... (training) Audio Signal Reecognition 2003-11-13 - 27 / 25 Neural net classifier • Models boundaries, not density p ( x ω i ) • Discriminant training - concentrate on boundary regions - needs to see all classes at once Dan Ellis Audio Signal Reecognition 2003-11-13 - 28 / 25 Why is Speech Recognition hard? • Why not match against a set of waveforms? - waveforms are never (nearly!) the same twice - speakers minimize information/effort in speech • Speech variability comes from many sources: - speaker-dependent (SD) recognizers must handle within-speaker variability - speaker-independent (SI) recognizers must also deal with variation between speakers - all recognizers are afflicted by background noise, variable channels → Need recognition models that: - generalize i.e. accept variations in a range, and - adapt i.e. ‘tune in’ to a particular variant Dan Ellis Audio Signal Reecognition 2003-11-13 - 29 / 25 Within-speaker variability • Timing variation: - word duration varies enormously Frequency 4000 2000 0 0 1.0 0.5 s ow 1.5 2.0 ay aa ax b aw ax ay ih k t s t ih l dx th n th n ih I ABOUT I IT'S STILL THOUGHT THAT THINK AND 2.5 3.0 p aa s b ax l th SO POSSIBLE - fast speech ‘reduces’ vowels • Speaking style variation: - careful/casual articulation - soft/loud speech • Contextual effects: - speech sounds vary with context, role: “How do you do?” Dan Ellis Audio Signal Reecognition 2003-11-13 - 30 / 25 mbma0 freq / Hz Between-speaker variability • Accent variation - regional / mother tongue • Voice quality variation - gender, age, huskiness, nasality • Individual characteristics - mannerisms, speed, prosody 8000 6000 4000 2000 0 8000 6000 fjdm2 4000 2000 0 0 Dan Ellis 0.5 1 1.5 Audio Signal Reecognition 2 2.5 2003-11-13 - 31 / 25 time / s Environment variability • Background noise - fans, cars, doors, papers • Reverberation - ‘boxiness’ in recordings • Microphone channel - huge effect on relative spectral gain Close mic freq / Hz 4000 2000 0 4000 Tabletop mic 2000 0 0 Dan Ellis 0.2 0.4 0.6 0.8 Audio Signal Reecognition 1 1.2 1.4 2003-11-13 - 32 / 25 time / s

Audio Signal Recognition for Speech, Music, and Environmental Sounds

Related documents

Products

Support

Audio Signal Recognition for Speech, Music, and Environmental Sounds

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib