Comparing Audio Signals What makes it difficult? • • • • • • • Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance Review: Minimum Distance Algorithm E X E C U T I O N 0 1 2 3 4 5 6 7 8 9 I 1 1 2 3 4 5 6 6 7 8 N 2 2 2 3 4 5 6 7 7 7 T 3 3 3 3 4 5 5 6 7 8 E 4 3 4 3 4 5 6 6 7 8 N 5 4 4 4 4 5 6 7 7 7 T 6 5 5 5 5 5 5 6 7 8 I 7 6 6 6 6 6 6 5 6 7 O 8 7 7 7 7 7 7 6 5 6 N 9 8 8 8 8 8 8 7 6 5 Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)} Pseudo Code (minDistance(target, source)) n = character in source m = characters in target Create array, distance, with dimensions n+1, m+1 FOR r=0 TO n distance[r,0] = r FOR c=0 TO m distance[0,c] = c FOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitution Result is in distance[n,m] Is Minimum Distance Applicable? • Maybe? – The optimal distance from indices [a,b] is a function of the costs with smaller indices. – This suggests that a dynamic approach may work. • Problems – The cost function is more complex. A binary equal or not equal doesn’t work – Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use. – Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisons Complexity of Minimum Distance • The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in. • O(n2) may be too slow. Alternate solutions have been devised. – Don’t fill in all of the cells. – Use a multi-level approach • Question: Are the faster approaches needed for our purposes? Perhaps not! Don’t Fill in all of the Cells Problem: May miss the optimal minimum distancepath The Multilevel Approach Concept 1. 2. 3. 4. 5. Down sample to coarsen the array Run the algorithm Refine the array (up sample) Adjust the solution Repeat steps 3-4 till the original sample rate is restored Notes •The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n) •Example is partitioning a graph to balance work loads among threads or processors Singularities • Assumption – The minimum distance comparing two signals only depends on the previous adjacent entries – The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-defined • Problem: The algorithm can compute incorrectly due to mismatched alignments • Possible solutions: – Compare based on the change of feature values between windows instead of the values themselves – Pre-process to eliminate the causes of the mismatches Possible Preprocessing • Remove the phase from the audio: – Compute the Fourier transform – Perform discrete cosine transform on the amplitudes • Normalize the energy of voiced audio: – Compute the energy of both signals – Multiply the larger by the percentage difference • Remove the DC offset: Subtract the average amplitude from all samples • Brick Wall Normalize the peaks and valleys: – Find the average peak and valley value – Set values larger than the average equal to the average • Normalize the pitch: Use PSOLA to align the pitch of the two signals • Remove duplicate frames: Auto correlate frames at pitch points • Remove noise from the signal: implement a noise removal algorithm • Normalize the speed of the speech: Which Audio Features? • Cepstrals: They are statistically independent and phase differences are removed • ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next • Energy: Distinguish the frames that are voiced verses those that are unvoiced • Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers. These are the popular features used for speech recognition Which Distance Metric? • General Formula: array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)} • Assumption : There is no cost assessed for duplicate or eliminated frames. • Distance Formula: – Euclidian: sum the square of one metric minus another squared – Linear: sum the absolute value of the distance between features • Weighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain features • Example of a distance metric using linear distance ∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight