Introduction

advertisement
Introduction
C.V.



Juan Arturo Nolazco-Flores
Associate Professor
Computer Science Department,
ITESM, campus Monterrey, México
Courses:
– Speech Processing, Computer Networks.



E-mail:
jnolazco@itesm.mx.
Office:
Biocomputing
Phone:
ext. 2726
Plan of Work
Introduction (0.5 hours)
 Signal Processing and Analysis
Methods (1 hours)

– Bank-of-filters
– Windowing
– LPC
– Cepstral Coefficients
– Vector Quantization
Plan of work

Speech Recognition (6 hours)
– HMM Basics (0.5 hours)
– Isoltated Word Recognition (1.5 hour)




Acoustid Modeling using HMM
Evaluation
Training D-HMM, CD-HMM
Language Model
– Continuous Word Recognition Using HMM (1.0
hour)



Evaluation
Training D-HMM, CD-HMM
Language Model
Introduction

What is speech recognition?
– It is the identification of words in an
utterance (speech-> orthographic transcription).
– Based on pattern matching techniques.
– Knowledge learn from data, usually using a
stochastic techniques.
– It uses powerful algorithms to optimise a
mathematical model for a given task.
Notes



Do not confuse with speech understanding, which
is the identification of the utterance meaning.
Do not confuse with speaker recognition:
Do not confuse with speaker identification, which is
the identification of a speaker in a set of speakers.
– Main Problem: The speaker do not want to be identified.

Do not confuse with speaker verification, which
verifies if a speaker is the one he (she) say he (she)
is.
– Main Problem: The speaker can have a pharyngeal problem.
Speech Recongition System Modules
ASR
Architecture
Database
text
speech
text
text
text
Scoring
Speech Recognition Disciplines




Signal Processing: Spectral analysis.
Physics (Acoustics): Human Hearing studies.
Pattern Recognition: Data clustering.
Communication and Information Theory:
statistical models, Viterbi algorithms, etc.
 Linguistics: grammar and language parsing.


Physiology: knowledge based systems.
Computer Science: efficient algorithms, UNIX, c
language.
Task classification
Mode of speaking
Speaker set
Environment Vocabulary
Isolated word Speaker Dependent noise free
small (<50)
Connect-word
Multi-speaker
office
medium (<500)
Continuous
Independent
telephone
large (<5000)
high noise very large (>5000)
History (50’s and 60’s)

Speaker Dependent,
– Isolated Digit Recognition System Bell Labs,
1952.
– Phone recogniser (4 vowels and 9 consonants)
(UCL, 1959).

Speaker Independent
– 10 vowels recognition (MIT, 1959).

Hardware Implementation of small I-SD (60s,
Japan).
History





DTW (Variability) and LPC in ASR (70’s).
Connect word Recognition (80’s).
HMMs and Neural Networks in ASR.
Large vocabulary, continuous ASR.
Standard Databases (90’s):
– DARPA (Defence Advanced Research Projects
Agency) project, 1000-word database.
– World Stree Journal (reading Database)
– http://www.ldc.upenn.edu/Catalog/

Spontaneous Speech (90’s)
speech
Database
text
text
text
text
Scoring
Database

Contains
– Waveform files store in a specific format
(i.e. PCM, micro-law, A-law, GSM).

SPHINX:
/net/alf33/usr1/databases/TELEFONICA_1/WAV_FILES
– Every waveform files has a Transcription
file (either phonemes, words ).
SPHINX:
 ../../training_input/train.lsn

History (at Present)
Domain Dependent ASR.
 Experimentation with new stochastic
modelling.
 Speech Recognition in Noise.
 Speech Recognition for Distorted
Speech (Cellular Phones, VoIP).
 Experimentation with new way to
caracterize the speech.

Why ASR is difficult?


Speech is a complex combination of
information from different levels that is used
to convey an information.
Signal variability:
– Intra-speaker variablity

emotional state, environment (Lombard effect)
– Inter-speaker variablity

physiological differences, accent, dialect, etc.
– Acoustic channel

Telephone channel, background noise/speech, etc.
speech
Database
text
text
text
text
Scoring
Acoustic Processing
(Speech Processing Front End)
Convert the speech waveform in some type of
parametric representation.
sk
Speech Processing
Front End
Parametric Representation:
Zero crossing rate,
Short time Energy,
Short time spectral envelope, etc.
Speech Analysis

What can we observe from this speech
waveform:

Speech Signal is non-stationary signal.
Acoustic Processing
(Signal Processing Front End)
Convert the speech waveform in some type of
parametric representation.
sk
Speech Processing
Front End
O(t,features)
Time-dependent
Parametric
Representation.
Examples



Articulation position vs. time
Signal Power vs. time (Cohen time-frequency
Analysis).
However, trying to obtain the changes in the
continuous feature and time spaces is
impossible (making some assumptions is
possible, but many of the results are not
useful for engineering point of view, i.e.
Negative Power Spectrums).

Fortunately, if we take small segments
of speech, then we can think the speech
is stationary in this small segments
(quasi-stationary).
Short Time Analysis
(Discrete time Time-Frequency function)
o(1)
o(2)
o(3)
o(4)
Changing time-resolution
o(1) o(2) o(3) o(4) o(5) o(6) o(7) o(8)
In speech, normally
 the size of the segments is between 15
and 25 msec.
 The sampling time is 10msec

Acoustic Processing
(Signal Processing Front End)
sk
Signal Processing
Front End
O=o(1)o(2)..o(T)
Where:
o(t)=[o(t,f_1), o(t,f_2)…o(t,f_P)]
Acoustic Processing
(Signal Processing Front End)
sk
Signal Processing
Front End
Physiological Modelling Processing
MFCC Processing
LP Cepstra Processing
O=o(1)o(2)..o(T)
Dynamic Features
In order to incorporate dynamic features
of the speech (context information of the
speech), the first and/or second
derivative can be used.
 For example:

cm (t )  
K
 kc
k  K
m
(t  k )
Acoustic Processing
(Signal Processing Front End)
sk
Signal Processing
Front End
O=o(1)o(2)..o(T)
Where:
o(t)=[o(t,f_1), o(t,f_2)…o(t,f_P),
o’(t,f_1), o’(t,f_2)…o’(t,f_P),
o”(t,f_1), o”(t,f_2)…o”(t,f_P)]
Download