From last time … ASR System Architecture Grammar Cepstrum Speech Signal Signal Processing Recognized Words “zero” “three” “two” Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03 Probability Estimator Decoder Pronunciation Lexicon A Few Points about Human Speech Recognition (See Chapter 18 for much more on this) Human Speech Recognition • Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) • Statistics of CVC perception • Comparisons between human and machine speech recognition • A few thoughts The Ear The Cochlea Assessing Recognition Accuracy • Intelligibility • Articulation - Fletcher experiments – CVC, VC, CV, syllables in carrier sentences – Tests over different SNR, bands – Example: “The first group is `mav’ (forced choice between mav and nav) – Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz. Results • S = vc2 • Articulation Index (the original “AI”) • Error independence between bands – – – – – Articulatory band ~ 1 mm along basilar membrane 20 filters between 300 and 8000 Hz A single zero error band -> no error! Robustness to a range of problems AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30 AI additivity • s(a,b) = phone accuracy for band from a to b, a<b<c • (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) • log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c)) • AI(s) = log10(1-s) / log10(1-smax) • AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c)) Jont Allen interpretation: The Big Idea • • • • Humans don’t use frame-like spectral templates Instead, partial recognition in bands Combined for phonetic (syllabic?) recognition Important for 3 reasons: – Based on decades of listening experiments – Based on a theoretical structure that matched the results – Different from what ASR systems do Questions about AI • Based on phones - the right unit for fluent speech? • Lost correlation between distant bands? • Lippmann experiments, disjoint bands – Signal above 8 kHz helps a lot in combination with signal below 800 Hz Human SR vs ASR: Quantitative Comparisons • Lippmann compilation (see book): typically ~factor of 10 in WER • Hasn’t changed too much since his study • Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch) Human SR vs ASR: Quantitative Comparisons (2) System 10 dB SNR 16 dB SNR “Quiet” Baseline HMM ASR 77.4% 42.2% 7.2% ASR w/ noise compensation 12.8% 10.0% - Human Listener 1.1% 1.0% 0.9% Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now) Human SR vs ASR: Qualitative Comparisons • • • • Signal processing Subword recognition Temporal integration Higher level information Human SR vs ASR: Signal Processing • Many maps vs one • Sampled across time-frequency vs sampled in time • Some hearing-based signal processing already in ASR Human SR vs ASR: Subword Recognition • Knowing what is important (from the maps) • Combining it optimally Human SR vs ASR: Temporal Integration • Using or ignoring duration (e.g., VOT) • Compensating for rapid speech • Incorporating multiple time scales Human SR vs ASR: Higher levels • • • • • Syntax Semantics Pragmatics Getting the gist Dialog to learn more Human SR vs ASR: Conclusions • When we pay attention, human SR much better than ASR • Some aspects of human models going into ASR • Probably much more to do, when we learn how to do it right