ch8.1 (recognition principles).ppt

advertisement
7-Speech Recognition
Speech Recognition Concepts
Speech Recognition Approaches
Recognition Theories
Bayse Rule
Simple Language Model
P(A|W) Network Types
1
7-Speech Recognition (Cont’d)
HMM Calculating Approaches
Neural Components
Three Basic HMM Problems
Viterbi Algorithm
State Duration Modeling
Training In HMM
2
Recognition Tasks
Isolated Word Recognition (IWR)
Connected Word (CW) , And Continuous
Speech Recognition (CSR)
Speaker Dependent, Multiple Speaker, And
Speaker Independent
Vocabulary Size
– Small
<20
– Medium
>100 , <1000
– Large
>1000, <10000
– Very Large >10000
3
Speech Recognition Concepts
Speech recognition is inverse of Speech Synthesis
Speech Synthesis
Text
Speech Speech
Phone
Processing Sequence
NLP
NLP
Speech
Processing
Text
Speech
Understanding
Speech Recognition
4
Speech Recognition
Approaches
Bottom-Up Approach
Top-Down Approach
Blackboard Approach
5
Bottom-Up Approach
Signal Processing
Knowledge Sources
Feature Extraction
Voiced/Unvoiced/Silence
Segmentation
Signal Processing
Sound Classification Rules
Feature Extraction
Phonotactic Rules
Segmentation
Lexical Access
Language Model
Segmentation
Recognized Utterance
6
Top-Down Approach
Inventory
Word
of speech Dictionary Grammar
recognition
units
Feature
Analysis
Syntactic
Hypo
thesis
Unit
Matching
System
Lexical
Hypo
thesis
Utterance
Verifier/
Matcher
Recognized Utterance
Task
Model
Semantic
Hypo
thesis
7
Blackboard Approach
Acoustic
Processes
Environmental
Processes
Lexical
Processes
Black
board
Semantic
Processes
Syntactic
Processes
8
Recognition Theories
Articulatory Based Recognition
– Use from Articulatory system for recognition
– This theory is the most successful until now
Auditory Based Recognition
– Use from Auditory system for recognition
Hybrid Based Recognition
– Is a hybrid from the above theories
Motor Theory
– Model the intended gesture of speaker
9
Recognition Problem
We have the sequence of acoustic
symbols and we want to find the words
that expressed by speaker
Solution : Finding the most probable of
word sequence by having Acoustic
symbols
10
Recognition Problem
A : Acoustic Symbols
W : Word Sequence
we should find Ŵ so that
P(Wˆ | A)  max P(W | A)
W
11
Bayse Rule
P( x | y) P( y)  P( x, y)
P( y | x) P( x)
P( x | y ) 
P( y )
P( A | W ) P(W )
 P(W | A) 
P( A)
12
Bayse Rule (Cont’d)
P(Wˆ | A)  max P(W | A)
W
P( A | W ) P(W )
 max
W
P( A)
Wˆ  Arg max P(W | A)
W
 Arg max P( A | W ) P(W )
W
13
Simple Language Model
w  w1w2 w3  wn
n
P( w)   P( wi | wi 1wi  2  w1 )
i 1
 P(W1 ) P(W2 | W1 ) P(W3 | W2 ,W1 )
P(W4 | W3 ,W2 ,W1 ).....
P(Wn | Wn 1 , Wn  2 ,..., W1 )
 P(Wn ,Wn 1 ,Wn  2 ,..., W1 )
Computing this probability is very difficult and we
need a very big database. So we use from Trigram
and Bigram models.
14
Simple Language Model
(Cont’d)
n
Trigram :
P( w)   P( wi | wi 1wi 2 )
i 1
n
Bigram :
P( w)   P(wi | wi 1 )
i 1
n
Monogram :
P( w)   P( wi )
i 1
15
Simple Language Model
(Cont’d)
Computing Method :
P(w3 | w2 w1 ) 
Number of happening W3 after W1W2
Total number of happening W1W2
AdHoc Method :
P(w3 | w2 w1 )  1 f (w3 | w2 w1 )  2 f (w3 | w2 )  3 f (w3 )
16
Error Production Factor
Prosody (Recognition should be
Prosody Independent)
Noise (Noise should be prevented)
Spontaneous Speech
17
P(A|W) Computing
Approaches
Dynamic Time Warping (DTW)
Hidden Markov Model (HMM)
Artificial Neural Network (ANN)
Hybrid Systems
18
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Search Limitation :
- First & End Interval
- Global Limitation
- Local Limitation
Dynamic Time Warping
Global Limitation :
Dynamic Time Warping
Local Limitation :
Artificial Neural Network
x0
x1 .w1
.
.
xN 1
w0

y
N 1
y  ( wi xi   )
i 0
wN 1
Simple Computation Element
of a Neural Network
26
Artificial Neural Network
(Cont’d)
Neural Network Types
– Perceptron
– Time Delay
– Time Delay Neural Network Computational
Element (TDNN)
27
Artificial Neural Network
(Cont’d)
Single Layer Perceptron
x0
...
xN 1
...
y0
yM 1
28
Artificial Neural Network
(Cont’d)
Three Layer Perceptron
...
...
...
...
29
2.5.4.2 Neural Network Topologies
30
TDNN
31
2.5.4.6 Neural Network Structures for
Speech Recognition
32
2.5.4.6 Neural Network Structures for
Speech Recognition
33
Hybrid Methods
Hybrid Neural Network and Matched Filter For
Recognition
Acoustic
Output Units
Speech
Features
Delays
PATTERN
CLASSIFIER
34
Neural Network Properties
The system is simple, But too much
iteration is needed for training
Doesn’t determine a specific structure
Regardless of simplicity, the results are
good
Training size is large, so training should be
offline
Accuracy is relatively good
35
Pre-processing
Different preprocessing techniques are
employed as the front end for speech
recognition systems
The choice of preprocessing method is
based on the task, the noise level, the
modeling tool, etc.
36
38
39
41
42
‫روش ‪MFCC‬‬
‫روش ‪ MFCC‬مبتني بر نحوه ادراک گوش انسان از اصوات مي باشد‪.‬‬
‫روش ‪ MFCC‬نسبت به ساير وي ِژگيها در محيطهاي نويزي بهتر عمل ميکند‪.‬‬
‫ً‬
‫‪ MFCC‬اساسا جهت کاربردهاي شناسايي گفتار ارايه شده است اما در شناسايي گوينده نيز‬
‫راندمان مناسبي دارد‪.‬‬
‫واحد شنيدار گوش انسان ‪ Mel‬مي باشد که به کمک رابطه زير بدست مي آيد‪:‬‬
‫‪43‬‬
‫مراحل روش ‪MFCC‬‬
‫مرحله ‪ :1‬نگاشت سيگنال از حوزه زمان به حوزه فرکانس به کمک ‪ FFT‬زمان‬
‫کوتاه‪.‬‬
‫)‪ :Z(n‬سيگنال گفتار‬
‫‪ :‬تابع پنجره مانند پنجره‬
‫(‪W(n‬همينگ‬
‫‪WF= e-j2π/F‬‬
‫;‪m : 0,…,F – 1‬‬
‫‪ :F‬طول فريم گفتاري‪.‬‬
‫‪44‬‬
‫مراحل روش ‪MFCC‬‬
‫مرحله ‪ :2‬يافتن انرژي هر کانال بانک فيلتر‪.‬‬
‫‪ M‬تعداد بانکهاي فيلتر مبتني بر معيار مل ميباشد‪.‬‬
‫‪k  0,1,..., M  1‬‬
‫لترهاي بانک فيلتر است‪.‬‬
‫تابع في‪W‬‬
‫)‪(j‬‬
‫‪k‬‬
‫‪45‬‬
‫توزيع فيلتر مبتنی بر معيار مل‬
‫‪46‬‬
‫مراحل روش ‪MFCC‬‬
‫مرحله ‪ :4‬فشرده سازي طيف و اعمال تبديل ‪ DCT‬جهت حصول به ضرايب‬
‫‪MFCC‬‬
‫‪‬‬
‫‪47‬‬
‫در رابطه باال ‪ n=0،...،L‬مرتبه ضرايب ‪ MFCC‬ميباشد‪.‬‬
‫کپستروم‬-‫روش مل‬
‫سیگنال زمانی‬
‫فریم بندی‬
|FFT|2
Mel-scaling
Logarithm
IDCT
Cepstra
Delta & Delta Delta Cepstra
Differentiator
48
Low-order
coefficients
‫ضرایب مل‬
‫کپستروم)‪(MFCC‬‬
‫‪49‬‬
‫ویژگی های مل‬
‫کپستروم)‪(MFCC‬‬
‫نگاشت انرژی های بانک فیلترمل درجتهی که واریانس آنها‬
‫ماکسیمم باشد (با استفاده از ‪)DCT‬‬
‫استقالل ویژگی های گفتاربه صورت غیرکامل نسبت به‬
‫یکدیگر(تاثیر ‪)DCT‬‬
‫پاسخ مناسب درمحیطهای تمیز‬
‫کاهش کارایی آن درمحیطهای نویزی‬
‫‪50‬‬
Time-Frequency analysis
Short-term Fourier Transform
– Standard way of frequency analysis: decompose the
incoming signal into the constituent frequency components.
– W(n): windowing function
– N: frame length
– p: step size
51
Critical band integration
Related to masking phenomenon: the
threshold of a sinusoid is elevated when its
frequency is close to the center frequency of
a narrow-band noise
Frequency components within a critical band
are not resolved. Auditory system interprets
the signals within a critical band as a whole
52
Bark scale
53
Feature orthogonalization
Spectral values in adjacent frequency
channels are highly correlated
The correlation results in a Gaussian
model with lots of parameters: have to
estimate all the elements of the
covariance matrix
Decorrelation is useful to improve the
parameter estimation.
54
Download