CMSC5707 Topics in A.I. CMSC Assignment 1 Audio signal processing Assignment 1 of CMSC5707 V4c 1 Task 1 • (5%) Recording of the templates: Use your own sound recording device (e.g. mobile phone, windows-soundrecorder or http://www.goldwave.com/) to record the numbers 1,2,3,4 and name these files as s1A.wav, s2A.wav, s3A.wav and s4A.wav, respectively. Each word should last about 0.60.8 seconds and use http://formatfactory.en.softonic.com/ to convert your file to .wav if necessary. (You may choose English or Cantonese or Mandarin to pronounce these words). These four files are called set A to be used as templates of our speech recognition system. You may use any sampling rate (Fs) and bits per second (bps) value. However, typical values are Fs=22050 Hz (or lower) and bps=16 bits per second. Assignment 1 of CMSC5707 V4c 2 Task 2 • (5%) Recording for the testing data: Repeat the above recording procedures of the same four numbers: 1, 2, 3 and 4, and save the four files as : s1B.wav, s2B.wav, s3B.wav and s4B.wav , respectively. They are to be used as testing data in our speech recognition system. Assignment 1 of CMSC5707 V4c 3 Task 3 • (5%) Plotting: – Pick one wav file out of your sound files (e.g. x.wav), read the file and plot the time domain signal. (Hint: you may use “wavread”, “plot” in MATLAB or OCTAVE. Type “>help wavread” , “>help plot” in MATLAB to learn how to use them.) – Plot x.wav and save it in a picture file “x.jpg”. Assignment 1 of CMSC5707 V4c 4 • (35%) Signal analysis: Task 4 – From “x.wav”, write a program to find the start (T1) and stop (T2) locations in time (ms) of your four recorded sounds automatically. – Extract one segment called Seg1 (20 ms of your choice of location) of the voiced vowel part of x.wav between T1 and T2. Seg1 can be saved as an array in C++ or a vector in MATLAB / OCTAVE . You may choose the segment by manual inspection and hardcode the locations in your program. – Find and plot the Fourier transform (energy against frequency) of Seg1. The energy is equal to |Square_root ([real]^2+[imaginary]^2)| . The horizontal axis is frequency and the vertical axis is energy. Label the axes of the plot. Save the plot as “fourier_x.jpg”. – Find the pre-emphasis signal (pem_Seg1) of Seg1 if the pre-emphasis constant α is 0.945. Plot Seg1 and Pem_Seg1. Submit your program. – Find the 10 LPC parameters if the order of LPC for Pem_seg1 is 10. You should write your autocorrelation code, but you may use the inverse function (inv) in MATLAB/OCTAVE to solve the linear matrix equation. Assignment 1 of CMSC5707 V4c 5 Task 5 • (50%) Build a speech recognition system: You may use any Matlab/Octvae functions you like in this part. Use the tool at http://www.mathworks.com/matlabcentral/fileexchange/3 2849-htk-mfcc-matlab to extract the MFCC parameters (Mel-frequency cepstrum http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) from your sound files. Each sound file (.wav) will give one set of MFCC parameters. See “A tutorial of using the htkmfcc tool” in the appendix of how to extract MFCC parameters. Build a dynamic programming DP based fournumeral speech recognition system. Use set A as templates and set B as testing inputs. You may follow the following steps to complete your assignment. Assignment 1 of CMSC5707 V4c 6 MFCC parameter's extraction From http://en.wikipedia.org/wiki/Mel-frequency_cepstrum • Very popular in music and speech analysis • Fourier transform of (a windowed excerpt of) a signal. • Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. • logs of the powers at each of the mel frequencies. • discrete cosine transform of the list of mel log powers, as if it were a signal. • The MFCCs are the amplitudes of the resulting spectrum. Assignment 1 of CMSC5707 V4c 7 MFCC (inside MFCC.m) • • • • • • • • • • • • • • • • • • • : Pre-emphasis the whole signal % Framing and windowing (frames as columns) frames = vec2frames( speech, Nw, Ns, 'cols', window, false ); % Magnitude spectrum computation (as column vectors) MAG = abs( fft(frames,nfft,1) ); % Triangular filterbank with uniformly spaced filters on mel scale H = trifbank( M, K, R, fs, hz2mel, mel2hz ); % size of H is M x K % Filterbank application to unique part of the magnitude spectrum FBE = H * MAG(1:K,:); % FBE( FBE<1.0 ) = 1.0; % apply mel floor % DCT matrix computation DCT = dctm( N, M ); % Conversion of logFBEs to cepstral coefficients through DCT CC = DCT * log( FBE ); % Cepstral lifter computation lifter = ceplifter( N, L ); % Cepstral liftering gives liftered cepstral coefficients CC = diag( lifter ) * CC; % ~ HTK's MFCCs : Assignment 1 of CMSC5707 V4c 8 Step (a) of task5 • Convert sound files in set A and set B into MFCCs parameters, so each sound file will give an MFCC matrix of size 13x70 (no_of_MFCCs_parameters x=13 and no_of_frame_segments=70). Because if the time shift is 10ms, a 0.7 seconds sound will have 70 frame segments, and there are 13 MFCC parameters for one frame. Here we use M (j,t), to represent the MFCC parameters, where ‘j’ is the index for MFCC parameters ranging from 1 to 13, ‘t’ is the index for time segment ranging from 1 to 70. Therefore a (13-parameter) sound segment at time index t is M(1:13,t). Assignment 1 of CMSC5707 V4c 9 Step(b) of task 5 • Assume we have two short time segments (e.g. 25 ms each), one from the tth (t=28) segment of sound X (represented by 13 MFCCS parameters Mx(1:13,t=28), and another from the t’th (t’=32) time segment of sound Y (represented by MFCCS parameters My(1:13,t’=32). The distortion (dist) between these two segments is dist j 13 Mx( j, t ) My( j, t ' ) 2 j 2 j 13 2 Mx ( j , 28 ) My ( j , 32 ) j 2 • Note: The first row of the of the MFCCs (M(1,j)) matrix is the energy term and is not recommended to be used in the comparison procedures because it does not contain the relevant spectral information. So summation starts from j=2. • Use dynamic programing to find the minimum accumulated distance (minimum accumulated score) between sound x and sound y. Assignment 1 of CMSC5707 V4c 10 Step (c) of task5 • Build a speech recognition system: You should show a 4x4 comparison-matrix-table as the result. An entry to this matrix-table is the minimum accumulated distance between a sound in set A and a sound in set B. You may use the above steps to find the minimum accumulated distance for each sound pair (there should be 4x4 pairs, because there are four sound files in set A and four sound files in set B) and enter the comparison-matrix-table manually or by a program. Assignment 1 of CMSC5707 V4c 11 Task (d) of Step 5 • Pick any one sound file from set A (e.g. the sound of ‘one’) and the corresponding sound file from set B (e.g. the sound of ‘one’), compare these two files using dynamic programing , plot the optimal path on the accumulated matrix diagram . Assignment 1 of CMSC5707 V4c 12 What to submit : – All your programs with a readme file showing how to run them – All sound files of your recordings – The picture files – The 4x4 comparison-matrix-table of the speech recognition system – Zip all (as student_number.zip) and submit it to cmsc5707.14@gmail.com Assignment 1 of CMSC5707 V4c 13 Step 6 • Pick any one sound file from set A (e.g. the sound of ‘one’) and the corresponding sound file from set B (e.g. the sound of ‘one’), compare these two files using dynamic programing , plot the optimal path on the accumulated matrix diagram . Assignment 1 of CMSC5707 V4c 14