Introduction C.V. Juan Arturo Nolazco-Flores Associate Professor Computer Science Department, ITESM, campus Monterrey, México Courses: – Speech Processing, Computer Networks. Ph.D. in Speech Recognition M.Phil. in Speech and Language Processing. M.Sc. In Control Engineering B.Sc. in Electronic Systems Useful Information E-mail: jnolazco@campus.mty.itesm.mx. Office: VII-426 Phone: 83-582000, ext. 4535, subext. 114 Plan of Work Fundamentals of Speech Science (8 hours) – Speech System Production – Acoustic-Phonetic Characterisation Modelling Speech Production Signal Processing and Analysis Methods (8 hours) – – – – – Bank-of-filters Windowing LPC Cepstral Coefficients Vector Quantization Plan of work Speech Recognition (12 hours) – Distances – Time Alignment and Normalisation – DTW – Discrete HMM – Continuos HMM Speech Coding Speech Recognition Chapter 1(Rabiner & Juang) Introduction What is speech recognition? – It is the identification of words in an utterance (speech-> orthographic transcription). – Based on pattern matching techniques. – Knowledge learn from data, usually using a stochastic techniques. – It uses powerful algorithms to optimise a mathematical model for a given task. Notes Do not confuse with speech understanding, which is the identification of the utterance meaning. Do not confuse with speaker recognition, which is the identification of a speaker in a set of speakers. – Main Problem: The speaker do not want to be recognised. Do not confuse with speaker verification, which verifies if a speaker is the one he (she) say he (she) is. – Main Problem: The speaker can have a pharyngeal problem. Word Speech Recognition Speech User Word Recognition System Syntax Set of valid words. Model for Speech Understanding Speech User Word Recognition Model Higher Level Processing Syntax, Semantics Pragmatics Voice Output Task Description Speech Recognition Disciplines Signal Processing: Spectral analysis. Physics (Acoustics): Human Hearing studies. Pattern Recognition: Data clustering Communication and Information Theory: statistical models, Viterbi algorithms, etc. Linguistics: grammar and language parsing. Physiology: knowledge based systems. Computer Science: efficient algorithms, UNIX, c language. History (50’s) Speaker Dependent Isolated Digit Recognition System (Bell Labs, 1952). Phone recogniser (4 vowels and 9 consonants) (UCL, 1959). – Statistical recognition Speaker Independent 10 vowels recognition (MIT, 1959). History (60’s) Hardware vowel recogniser (Radio Research Lab. in Tokyo, 1960). Hardware phoneme recogniser (Kyoto University, 1962). Realistic solution to the problem of nonuniformity of time scales in speech events (RCA Labs. 1964). DTW (Soviet Union, 1968). Re-discovered in the 80’s in the west. Continuous Tracking of phonemes (CMU,1966). History (70’s) Research effort in isolated word Recognition. Dynamic programming methods successfully applied in Speech Recognition. Uses of LPC in Speech Recognition. Start work on Independent Speaker Speech Recognition. History (80’s) Research effort in Connect word Recognition. Template based approach to statistical modelling methods (specially HMMs). Applications of Neural Networks to Speech Recognition. Large impetus to Large vocabulary speech recognition, continuous speech recognition. DARPA (Defence Advanced Research Projects Agency) project, which sponsored a large research program to obtain a high recognition performance for a 1000-word database. History (90’s) DARPA project Emphasis in natural language. – Spontaneous Speech Speech-technology used in within telephone networks. Why is it difficult? Speech is a complex combination of information from different levels that is used to convey an information. Signal variability: – Intra-speaker variablity emotional state, environment (Lombard effect) – Inter-speaker variablity physiological differences, accent, dialect, etc. – Acoustic channel Telephone channel, background noise/speech, etc. Task classification Mode of speaking Speaker set Environment Vocabulary Isolated word Speaker Dependent noise free small (<50) Connect-word Multi-speaker office medium (<500) Continuous Independent telephone large (<5000) high noise very large (>5000)