Comparing Audio Signals

advertisement
Comparing Audio Signals
What makes it difficult?
•
•
•
•
•
•
•
Phase misalignment
Deeper peaks and valleys
Pitch misalignment
Energy misalignment
Embedded noise
Length of vowels
Phoneme variance
Review: Minimum Distance Algorithm
E
X
E
C
U
T
I
O
N
0
1
2
3
4
5
6
7
8
9
I
1
1
2
3
4
5
6
6
7
8
N
2
2
2
3
4
5
6
7
7
7
T
3
3
3
3
4
5
5
6
7
8
E
4
3
4
3
4
5
6
6
7
8
N
5
4
4
4
4
5
6
7
7
7
T
6
5
5
5
5
5
5
6
7
8
I
7
6
6
6
6
6
6
5
6
7
O
8
7
7
7
7
7
7
6
5
6
N
9
8
8
8
8
8
8
7
6
5
Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)}
Pseudo Code (minDistance(target, source))
n = character in source
m = characters in target
Create array, distance, with dimensions n+1, m+1
FOR r=0 TO n distance[r,0] = r
FOR c=0 TO m distance[0,c] = c
FOR each row r
FOR each column c
IF source[r]=target[c]
cost = 0
ELSE cost = 1
distance[r,c]=minimum of
distance[r-1,c] + 1,
//insertion
distance[r, c-1] + 1,
//deletion
and distance[r-1,c-1] + cost) //substitution
Result is in distance[n,m]
Is Minimum Distance Applicable?
• Maybe?
– The optimal distance from indices [a,b] is a function of the costs
with smaller indices.
– This suggests that a dynamic approach may work.
• Problems
– The cost function is more complex. A binary equal or not equal
doesn’t work
– Need to define a distance metric. But what should that metric
be? Answer: It depends on which audio features we use.
– Longer vowels may still represent the same speech. The classical
solution is not to apply a cost when going from index [i-1,j] or
[i,j-1] to [I,j]. Unfortunately, this assumption can lead to
singularities, which result in incorrect comparisons
Complexity of Minimum Distance
• The basic algorithm is O(m*n) where m is the
length (samples) of one audio signal and m is the
length of the other. If m=n, the algorithm is O(n2).
Why?: count the number of cells that need to be
filled in.
• O(n2) may be too slow. Alternate solutions have
been devised.
– Don’t fill in all of the cells.
– Use a multi-level approach
• Question: Are the faster approaches needed for
our purposes? Perhaps not!
Don’t Fill in all of the Cells
Problem: May miss the optimal minimum distancepath
The Multilevel Approach
Concept
1.
2.
3.
4.
5.
Down sample to coarsen the array
Run the algorithm
Refine the array (up sample)
Adjust the solution
Repeat steps 3-4 till the original sample
rate is restored
Notes
•The multilevel approach is a common
technique for increasing many algorithms’
complexity from O(n2) to O(n lg n)
•Example is partitioning a graph to balance
work loads among threads or processors
Singularities
• Assumption
– The minimum distance comparing two signals only
depends on the previous adjacent entries
– The cost function accounts for the varied length of a
particular phoneme, which causes the cost in particular
array indices to no longer be well-defined
• Problem: The algorithm can compute incorrectly due
to mismatched alignments
• Possible solutions:
– Compare based on the change of feature values between
windows instead of the values themselves
– Pre-process to eliminate the causes of the mismatches
Possible Preprocessing
• Remove the phase from the audio:
– Compute the Fourier transform
– Perform discrete cosine transform on the amplitudes
• Normalize the energy of voiced audio:
– Compute the energy of both signals
– Multiply the larger by the percentage difference
• Remove the DC offset: Subtract the average amplitude from all samples
• Brick Wall Normalize the peaks and valleys:
– Find the average peak and valley value
– Set values larger than the average equal to the average
• Normalize the pitch: Use PSOLA to align the pitch of the two signals
• Remove duplicate frames: Auto correlate frames at pitch points
• Remove noise from the signal: implement a noise removal algorithm
• Normalize the speed of the speech:
Which Audio Features?
• Cepstrals: They are statistically independent
and phase differences are removed
• ΔCepstrals, or ΔΔCepstrals: Reflects how the
signal is changing from one frame to the next
• Energy: Distinguish the frames that are voiced
verses those that are unvoiced
• Normalized LPC Coefficients: Represents the
shape of the vocal track normalized by vocal
tract length for different speakers.
These are the popular features used for speech recognition
Which Distance Metric?
• General Formula:
array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}
• Assumption : There is no cost assessed for duplicate or
eliminated frames.
• Distance Formula:
– Euclidian: sum the square of one metric minus another squared
– Linear: sum the absolute value of the distance between features
• Weighting the features: Multiply each metric’s difference by a
weighting factor to give greater/lesser emphasis to certain features
• Example of a distance metric using linear distance
∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for
signals a and b. w[i] is that feature’s weight
Download