ppt_on_assignment1

advertisement
CMSC5707 Topics in A.I.
CMSC Assignment 1
Audio signal processing
Assignment 1 of CMSC5707 V4c
1
Task 1
• (5%) Recording of the templates: Use your own sound
recording device (e.g. mobile phone, windows-soundrecorder or http://www.goldwave.com/) to record the
numbers 1,2,3,4 and name these files as s1A.wav, s2A.wav,
s3A.wav and s4A.wav, respectively. Each word should last
about 0.60.8 seconds and use http://formatfactory.en.softonic.com/ to convert your file to .wav if
necessary. (You may choose English or Cantonese or
Mandarin to pronounce these words). These four files are
called set A to be used as templates of our speech
recognition system. You may use any sampling rate (Fs) and
bits per second (bps) value. However, typical values are
Fs=22050 Hz (or lower) and bps=16 bits per second.
Assignment 1 of CMSC5707 V4c
2
Task 2
• (5%) Recording for the testing data: Repeat
the above recording procedures of the same
four numbers: 1, 2, 3 and 4, and save the four
files as : s1B.wav, s2B.wav, s3B.wav and
s4B.wav , respectively. They are to be used as
testing data in our speech recognition system.
Assignment 1 of CMSC5707 V4c
3
Task 3
• (5%) Plotting:
– Pick one wav file out of your sound files (e.g.
x.wav), read the file and plot the time domain
signal. (Hint: you may use “wavread”, “plot” in
MATLAB or OCTAVE. Type “>help wavread” ,
“>help plot” in MATLAB to learn how to use
them.)
– Plot x.wav and save it in a picture file “x.jpg”.
Assignment 1 of CMSC5707 V4c
4
• (35%) Signal analysis:
Task 4
– From “x.wav”, write a program to find the start (T1) and stop (T2)
locations in time (ms) of your four recorded sounds automatically.
– Extract one segment called Seg1 (20 ms of your choice of location) of
the voiced vowel part of x.wav between T1 and T2. Seg1 can be saved
as an array in C++ or a vector in MATLAB / OCTAVE . You may choose
the segment by manual inspection and hardcode the locations in your
program.
– Find and plot the Fourier transform (energy against frequency) of
Seg1. The energy is equal to |Square_root ([real]^2+[imaginary]^2)| .
The horizontal axis is frequency and the vertical axis is energy. Label
the axes of the plot. Save the plot as “fourier_x.jpg”.
– Find the pre-emphasis signal (pem_Seg1) of Seg1 if the pre-emphasis
constant α is 0.945. Plot Seg1 and Pem_Seg1. Submit your program.
– Find the 10 LPC parameters if the order of LPC for Pem_seg1 is 10. You
should write your autocorrelation code, but you may use the inverse
function (inv) in MATLAB/OCTAVE to solve the linear matrix equation.
Assignment 1 of CMSC5707 V4c
5
Task 5
• (50%) Build a speech recognition system: You may use any
Matlab/Octvae functions you like in this part. Use the tool
at
http://www.mathworks.com/matlabcentral/fileexchange/3
2849-htk-mfcc-matlab to extract the MFCC parameters
(Mel-frequency cepstrum
http://en.wikipedia.org/wiki/Mel-frequency_cepstrum)
from your sound files. Each sound file (.wav) will give one
set of MFCC parameters. See “A tutorial of using the htkmfcc tool” in the appendix of how to extract MFCC
parameters. Build a dynamic programming DP based fournumeral speech recognition system. Use set A as
templates and set B as testing inputs. You may follow the
following steps to complete your assignment.
Assignment 1 of CMSC5707 V4c
6
MFCC parameter's extraction
From
http://en.wikipedia.org/wiki/Mel-frequency_cepstrum
• Very popular in music and speech analysis
• Fourier transform of (a windowed excerpt of) a signal.
• Map the powers of the spectrum obtained above onto the mel scale, using
triangular overlapping windows.
• logs of the powers at each of the mel frequencies.
• discrete cosine transform of the list of mel log powers, as if it were a
signal.
• The MFCCs are the amplitudes of the resulting spectrum.
Assignment 1 of CMSC5707 V4c
7
MFCC (inside MFCC.m)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
:
Pre-emphasis the whole signal
% Framing and windowing (frames as columns)
frames = vec2frames( speech, Nw, Ns, 'cols', window, false );
% Magnitude spectrum computation (as column vectors)
MAG = abs( fft(frames,nfft,1) );
% Triangular filterbank with uniformly spaced filters on mel scale
H = trifbank( M, K, R, fs, hz2mel, mel2hz ); % size of H is M x K
% Filterbank application to unique part of the magnitude spectrum
FBE = H * MAG(1:K,:); % FBE( FBE<1.0 ) = 1.0; % apply mel floor
% DCT matrix computation
DCT = dctm( N, M );
% Conversion of logFBEs to cepstral coefficients through DCT
CC = DCT * log( FBE );
% Cepstral lifter computation
lifter = ceplifter( N, L );
% Cepstral liftering gives liftered cepstral coefficients
CC = diag( lifter ) * CC; % ~ HTK's MFCCs
:
Assignment 1 of CMSC5707 V4c
8
Step (a) of task5
• Convert sound files in set A and set B into MFCCs
parameters, so each sound file will give an MFCC
matrix of size 13x70 (no_of_MFCCs_parameters
x=13 and no_of_frame_segments=70). Because if
the time shift is 10ms, a 0.7 seconds sound will
have 70 frame segments, and there are 13 MFCC
parameters for one frame. Here we use M (j,t),
to represent the MFCC parameters, where ‘j’ is
the index for MFCC parameters ranging from 1 to
13, ‘t’ is the index for time segment ranging from
1 to 70. Therefore a (13-parameter) sound
segment at time index t is M(1:13,t).
Assignment 1 of CMSC5707 V4c
9
Step(b) of task 5
• Assume we have two short time segments (e.g. 25 ms each),
one from the tth (t=28) segment of sound X (represented by
13 MFCCS parameters Mx(1:13,t=28), and another from the
t’th (t’=32) time segment of sound Y (represented by MFCCS
parameters My(1:13,t’=32). The distortion (dist) between
these two segments is
dist 
j 13
 Mx( j, t )  My( j, t ' )
2
j 2

j 13
2


Mx
(
j
,
28
)

My
(
j
,
32
)

j 2
• Note: The first row of the of the MFCCs (M(1,j)) matrix is the
energy term and is not recommended to be used in the
comparison procedures because it does not contain the
relevant spectral information. So summation starts from j=2.
• Use dynamic programing to find the minimum accumulated
distance (minimum accumulated score) between sound x and
sound y.
Assignment 1 of CMSC5707 V4c
10
Step (c) of task5
• Build a speech recognition system: You should show
a 4x4 comparison-matrix-table as the result. An entry
to this matrix-table is the minimum accumulated
distance between a sound in set A and a sound in set
B. You may use the above steps to find the minimum
accumulated distance for each sound pair (there
should be 4x4 pairs, because there are four sound
files in set A and four sound files in set B) and enter
the comparison-matrix-table manually or by a
program.
Assignment 1 of CMSC5707 V4c
11
Task (d) of Step 5
• Pick any one sound file from set A (e.g. the
sound of ‘one’) and the corresponding sound
file from set B (e.g. the sound of ‘one’),
compare these two files using dynamic
programing , plot the optimal path on the
accumulated matrix diagram .
Assignment 1 of CMSC5707 V4c
12
What to submit :
– All your programs with a readme file showing how
to run them
– All sound files of your recordings
– The picture files
– The 4x4 comparison-matrix-table of the speech
recognition system
– Zip all (as student_number.zip) and submit it to
cmsc5707.14@gmail.com
Assignment 1 of CMSC5707 V4c
13
Step 6
• Pick any one sound file from set A (e.g.
the sound of ‘one’) and the
corresponding sound file from set B (e.g.
the sound of ‘one’), compare these two
files using dynamic programing , plot the
optimal path on the accumulated matrix
diagram .
Assignment 1 of CMSC5707 V4c
14
Download