"What Does Speech 'Look' Like to an Automatic Speech Recognition System?" - Jiang Wu

advertisement
to an Automatic Speech Recognition System?
Jiang Wu
jiang.wu@binghamton.edu
Electrical Engineering Department

Say if you are only
allowed to use 39
values to represent a
speech seg of 1 sec
long…

What are good features?
◦ Discriminative
◦ “Curse of Dimensionality”
For each frame of prerecorded speech, we
try to extract the
feature as to compress
its spectrum.
0.15
0.1
BV0
0.05
Amplitude

0
-0.05
-0.1
BV1
BV2
-0.15
-0.2
0
2
4
Frequency [kHz]
6
8

Most recent research has
shown that spectral
trajectories, over time, also
play an important role in
ASR.
0.2
0.15
0.1
0.05
Amplitude

BV0
0
-0.05
BV2
-0.1
Thus, we also want to let
computers see what happens
over time, about the center
of each static feature.
BV1
-0.15
-0.2
-0.25
1
-60
-40
-20
0
20
Time [ms]
40
60
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
5
10
0
40
30
10
20
20
10
30
0
20
15
15
20
10
25
5
30
0
Original Spectrogram
Rebuilt Spectrogram
8000
8000
7000
7000
6000
5000
Frequency (Hz)
Frequency (Hz)
6000
4000
3000
5000
4000
3000
2000
2000
1000
0
0.5
1000
0.6
0.7
0.8
0.9
1
1.1
Time (Sec)
1.2
1.3
1.4
1.5
0
0.5
0.6
0.7
0.8
0.9
1
1.1
Time (Sec)
1.2
1.3
1.4
1.5
Speech waveform
◦ Strongly related to “tones”
◦ Very popular feature type for
tonal languages: Mandarin,
Cantonese, Some of Korean
dialects, etc.
Amplitude
2000
0
-2000
-4000
0
0.5
1
1.5
Time (seconds)
Pitch
0
0.5
1
1.5
Time (seconds)
Spectrogram
0.5
1
1.5
Time
2
2.5
3
300
Frequency (Hz)
Pitch Contour :
200
100
0
2
2.5
8000
Frequency (Hz)

4000
6000
4000
2000
0

2
2.5
“Perceptual Features:
◦ Analyze speech signal as to
how human’s auditory system
“perceptually” process sound
◦ Frequency resolution and time
resolution both depend on
frequency and time..
3






Project 1: To create an open source multilanguage audio database for spoken
language processing applications.
Project 2: To understand tonal languages.
Dr. Montri Karnjanadecha
Chandra Vootkuri (Ph.D.)
Brian Wong
(M.S.)
Andrew Hwang(M.S.)
(Force Alignment)
(Landmark Theory)
(Freq. Non-linearity)
(Feature Transform)
◦ Signal processing (A/D)
◦ Probability theory, pattern recognition and
machine learning
◦ Understanding of human auditory system/
linguistic/musicality will be a bonus!
Tons of interesting
applications!!
◦ Pronunciation therapy
◦ Hearing aids
◦ Singing voice processing
+
◦ Not just the Microsoft speech-text software on
your PC..
Questions?
Download