Part1

advertisement
Introduction
C.V.



Juan Arturo Nolazco-Flores
Associate Professor
Computer Science Department,
ITESM, campus Monterrey, México
Courses:
– Speech Processing, Computer Networks.




Ph.D. in Speech Recognition
M.Phil. in Speech and Language Processing.
M.Sc. In Control Engineering
B.Sc. in Electronic Systems
Useful Information
E-mail:
jnolazco@campus.mty.itesm.mx.
 Office:
VII-426
 Phone:
83-582000, ext. 4535, subext. 114

Plan of Work

Fundamentals of Speech Science (8 hours)
– Speech System Production
– Acoustic-Phonetic Characterisation


Modelling Speech Production
Signal Processing and Analysis Methods (8
hours)
–
–
–
–
–
Bank-of-filters
Windowing
LPC
Cepstral Coefficients
Vector Quantization
Plan of work

Speech Recognition (12 hours)
– Distances
– Time Alignment and Normalisation
– DTW
– Discrete HMM
– Continuos HMM

Speech Coding
Speech Recognition
Chapter 1(Rabiner & Juang)
Introduction

What is speech recognition?
– It is the identification of words in an
utterance (speech-> orthographic transcription).
– Based on pattern matching techniques.
– Knowledge learn from data, usually using a
stochastic techniques.
– It uses powerful algorithms to optimise a
mathematical model for a given task.
Notes


Do not confuse with speech understanding, which
is the identification of the utterance meaning.
Do not confuse with speaker recognition, which is
the identification of a speaker in a set of speakers.
– Main Problem: The speaker do not want to be recognised.

Do not confuse with speaker verification, which
verifies if a speaker is the one he (she) say he (she)
is.
– Main Problem: The speaker can have a pharyngeal problem.
Word Speech Recognition
Speech
User
Word
Recognition
System
Syntax
Set of valid words.
Model for Speech Understanding
Speech
User
Word
Recognition
Model
Higher
Level
Processing
Syntax,
Semantics
Pragmatics
Voice
Output
Task
Description
Speech Recognition Disciplines




Signal Processing: Spectral analysis.
Physics (Acoustics): Human Hearing studies.
Pattern Recognition: Data clustering
Communication and Information Theory:
statistical models, Viterbi algorithms, etc.
 Linguistics: grammar and language parsing.


Physiology: knowledge based systems.
Computer Science: efficient algorithms, UNIX, c
language.
History (50’s)
Speaker Dependent Isolated Digit
Recognition System (Bell Labs, 1952).
 Phone recogniser (4 vowels and 9
consonants) (UCL, 1959).

– Statistical recognition

Speaker Independent 10 vowels
recognition (MIT, 1959).
History (60’s)





Hardware vowel recogniser (Radio Research Lab. in
Tokyo, 1960).
Hardware phoneme recogniser (Kyoto University,
1962).
Realistic solution to the problem of nonuniformity of
time scales in speech events (RCA Labs. 1964).
DTW (Soviet Union, 1968). Re-discovered in the 80’s
in the west.
Continuous Tracking of phonemes (CMU,1966).
History (70’s)
Research effort in isolated word
Recognition.
 Dynamic programming methods
successfully applied in Speech
Recognition.
 Uses of LPC in Speech Recognition.
 Start work on Independent Speaker
Speech Recognition.

History (80’s)





Research effort in Connect word Recognition.
Template based approach to statistical modelling
methods (specially HMMs).
Applications of Neural Networks to Speech
Recognition.
Large impetus to Large vocabulary speech
recognition, continuous speech recognition.
DARPA (Defence Advanced Research Projects
Agency) project, which sponsored a large research
program to obtain a high recognition performance for
a 1000-word database.
History (90’s)
DARPA project
 Emphasis in natural language.

– Spontaneous Speech

Speech-technology used in within
telephone networks.
Why is it difficult?


Speech is a complex combination of
information from different levels that is used
to convey an information.
Signal variability:
– Intra-speaker variablity

emotional state, environment (Lombard effect)
– Inter-speaker variablity

physiological differences, accent, dialect, etc.
– Acoustic channel

Telephone channel, background noise/speech, etc.
Task classification
Mode of speaking
Speaker set
Environment Vocabulary
Isolated word Speaker Dependent noise free
small (<50)
Connect-word
Multi-speaker
office
medium (<500)
Continuous
Independent
telephone
large (<5000)
high noise very large (>5000)
Download