Recognizer Issues • Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct. • Problem: How do we handle a string of text containing words that are not in the dictionary. • Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense Correcting Recognizer Ambiguities • Problem: Resolving words not in the dictionary • Question: How different is a recognized word from those that are in the dictionary? • Solution: Count the single step transformations necessary to convert one word into another. • Example: caat cat with removal of one letter • Example: flpc fireplace requires adding the letters ire after f and a before c and e at the end Spelling Error Types • Levenshtein distance: the smallest number of insertion, deletion, or substitution operations that transforms one string into another • Examples: differences from the word, “cat” – Insertion: catt – Deletion: ct – Substitution: car – Transposition: cta Note: There are similar well-defined rules for pronunciation variations Spell Check Algorithm FOR unrecognized words or those out of context Generate a list of candidates – Those that differ by a one step transformation – Those that exist in the lexicon Order possibilities using language-based statistics RETURN most likely word Note: We could extend the algorithm to consider multiple transformations when we cannot find a single step solution Example Mispelled word: acress Context Context * P(c) Candidates – with probabilities of use and use within context Which correction is most likely? Misspelled word: accress • Word frequency percentage is not enough – We need p(typo|candidate) * p(candidate) • How likely is the particular error? – – – – – – – Deletion of a t after a c and before an r Insertion of an a at the beginning Transpose a c and an a Substitute a c for an r Substitute an o for an e Insert an s before the last s, or after the last s Context of the word within a sentence or paragraph Additional Issues to Consider • What if there is more than one error per word? – Possible Solution: Compute minimum edit distance and allow more than one transformation • Some words in the lexicon may not appear in the frequency corpus that we use – Possible Solution: Smoothing algorithms • Is P(typo|candidate) accurate? – Possible Solution: Train the algorithm • Choice often depends on context – Possible Solution: Context sensitive algorithms • Names and places likely do not exist in the lexicon – This is a difficult issue Dynamic Programming Definition: Nesting small decision problems inside larger decisions (Richard Bellman) Two approaches: 1. Top Down: Start out with a method call to compute a final solution. The method proceeds to solve sub-problems of the same kind. Eventually the algorithm reaches and solves the “base case.” The algorithm can avoid repetitive calculations by storing the answers to the sub-problems for later access. 2. Bottom Up: Start with the base cases, and work upwards, using previous results to reach the final answer. The examples on the slides that follow use the bottom up approach 8 Dynamic Programming Algorithms Implement recursive algorithms using arrays • Recursion is popular for solving ‘divide and conquer’ problems – Divide problem into smaller problems of the same kind – Define the base case – Works from the original problem down to the base case • Dynamic programming is another approach that uses arrays – Initialize a variable or table – Use loops fill in entries the table. The algorithm always fills in entries in the table before they are needed – Works top-down or bottom-up towards the final solution – Avoids repetitive calculations because the technique stores results in a table for later use – Eliminates recursion activation record overhead Dynamic programming is a widely used algorithmic technique Fibonacci Sequence {0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2) Bottom Up int fibo(n) { if (n <=1) return n; int fMinusTwo = 0; int fMinusOne = 1; for (int i = 2; i <= n; i++) { f = fAMinusTwo + fMinusOne; fMinusTwo = fMinusOne; fAtNMinusOne = f; } return(f) } Top Down int[] fibs = new int[MAX]; fibs[0] = 0; fibs[1] = 1; int max = 1; int fibo(n) { if (n<=max) return fibs[n]; else fibs[n] = fibo(n-1)+fibo(n-2); max = n; return fibs[n]; } Example: Knapsack Problem A thief enters a house with a knapsack of a particular size. Which items of different values does he choose to fill his knapsack. • Values[v] containing values of the N items • Weights[v] contains the weights of the N items • K = knapsack capacity • dynTable[v][w] contains the dynamic algorithm table KnapSack Algorithm knapSack(int[] w, int[] v, int K) { int[][] dynTable[v.length+1][K]; for (int w = 0 to K) a[0][w] = 0; for (int v=1; v<=v.length; v++) { for (int w=0; w<W; w++) { dyn[v][w] = dynTable[v-1][w]; // Copy up if (w[v] <= w && ) dynTable[v][w] = // Try for better choice Math.max(dynTable[v][w], values[v]+dynTable[v-1,w-weights[v]]); } } return dynTable[N,W]; } Note: Row for each item to consider; column for integer weights Knapsack Example Goal: Maximum profit with a 15kg knapsack? Array cells represent value stuffed into knapsack of that weight 0 1 2 3 4 5 Weights ? 2 3 4 5 8 Values ? 3 2 6 10 9 Weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 I T 1 E 2 M 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 3 3 3 5 5 5 5 5 5 5 5 5 5 5 0 0 3 3 6 6 9 9 9 11 11 11 11 11 11 11 4 0 0 3 3 6 10 10 13 13 16 16 19 19 19 21 21 5 0 0 3 3 6 10 10 13 13 16 16 19 19 19 21 22 Example: Minimum Edit Distance A useful dynamic programming algorithm • Problem: How can we measure how different one word is from another word (ie spell checker)? – How many operations will transform one word into another? – Examples: caat --> cat, fplc --> fireplace • Definition: – Levenshtein distance: smallest number of insertion, deletion, or substitution operations to transform one string into another – Each insertion, deletion, or substitution is one operation, with a cost of 1. • Requires a two dimension array – Rows: source word positions, Columns: spelled word positions – Cells: distance[r][c] is the distance (cost) up to that point Pseudo Code (minDistance(target, source)) n = character in source m = characters in target Create array, distance, with dimensions n+1, m+1 FOR r=0 TO n distance[r,0] = r FOR c=0 TO m distance[0,c] = c FOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitution Result is in distance[n,m] Example 0 G 1 A 2 M 3 B 4 O 5 L 6 G U M B O 1 2 3 4 5 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Initialization Example G U M B O 0 1 2 3 4 5 G 1 0 A 2 1 M 3 2 B 4 3 O 5 4 L 6 5 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 1 Example G U M B O 0 1 2 3 4 5 G 1 0 1 A 2 1 1 M 3 2 2 B 4 3 3 O 5 4 4 L 6 5 5 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 2 Example G U M B O 0 1 2 3 4 5 G 1 0 1 2 A 2 1 1 2 M 3 2 2 1 B 4 3 3 2 O 5 4 4 3 L 6 5 5 4 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 3 Example G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 A 2 1 1 2 3 M 3 2 2 1 2 B 4 3 3 2 1 O 5 4 4 3 2 L 6 5 5 4 3 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 4 Example G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 4 A 2 1 1 2 3 4 M 3 2 2 1 2 3 B 4 3 3 2 1 2 O 5 4 4 3 2 1 L 6 5 5 4 3 2 • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 5 • Result: Distance equals 2 Another Example E X E C U T I O N 0 1 2 3 4 5 6 7 8 9 I 1 1 2 3 4 5 6 6 7 8 N 2 2 2 3 4 5 6 7 7 7 T 3 3 3 3 4 5 5 6 7 8 E 4 3 4 3 4 5 6 6 7 8 N 5 4 4 4 4 5 6 7 7 7 T 6 5 5 5 5 5 5 6 7 8 I 7 6 6 6 6 6 6 5 6 7 O 8 7 7 7 7 7 7 6 5 6 N 9 8 8 8 8 8 8 7 6 5 Comparing Audio Frames • Patterns: Database of audio samples. Match to the ones that are closest • Templates: Database of features extracted from audio samples • Training: Use a training set to create a vector of patterns or templates • Distortion Measure: algorithm to measure how far a pair of templates or patterns are apart Will Minimum Edit Distance Work? • Maybe: Distances to higher array indices are functions of the costs of smaller indices; a dynamic programming approach may work • Issues – – – – – – – The algorithm may be too slow A binary equal or not equal comparison does not work A distance metric is needed Speaking rates change, even with a single speaker Do we compare the raw data or frame-based features? Do we assign cost to adjacent cells or to those further away? Other issues: Phase, energy, pitch misalignment, Presence of noise, length of vowels, Phoneme pronunciation variances, etc. Incorrect comparisons occur when the algorithm isn’t carefully designed Dynamic Time Warping Goal: Find “best” alignment between pairs of audio frames (A) The matrix to the right shows the optimal alignment path (warping) between frames from utterance A with those of utterance B time (frame) of (B) (B) time (frame) of (A) Dynamic Time Warping (DTW) Overview • Computes the “distance” between 2 frames of speech o Measures frame by frame distances to compute dissimilarities between speech o Allows the comparison to warp the comparison to account for differences in speaking rates o Requires a cost function to account for different paths through frames of speech o Uses dynamic programming algorithm to find best warping. o Computes a total “distortion score” for best warped path. • Assumptions o Constrain begin and end times to be (1,1) and (TA,TB) o Allow only monotonically increasing time o Don’t allow too many frames to be skipped o Can express results in terms of “paths” with “slope weights” 26 Assumptions • Does not require that both patterns have the same length • One speech pattern is the “input” and the other speech pattern is the “template” to compare against • Divide speech signal into equally-spaced frames (10-30ms) with approximately 50% overlap and compute a frame-based feature set • The local distance measure (d) is the distance between features at a pair of frames (one from A, one from B). • The Global distortion from beginning of utterance until current pair of frames called G. 27 Algorithm Efficiency • The algorithm complexity is O(m*n) where m and n are the respective number of frames between the two utterances. If m=n, the algorithm is O(n2). Why?: Count the number of cells that need to be filled in. • O(n2) may be too slow. Alternate solutions have been devised. – Don’t fill in all of the cells. – Use a multi-level approach Don’t Fill in all of the Cells Disadvantage: The algorithm may miss the optimal path The Multilevel Approach Concept 1. 2. 3. 4. 5. Coarsen the array of features Run the algorithm Refine the array Adjust the solution Repeat steps 3-4 till the original array of features is restored Notes • The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n) • Example: partitioning a graph to balance work loads among threads or processors Which Audio Features? • Cepstrals: They are statistically independent and phase differences are removed • ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next • Energy: Distinguish the frames that are voiced verses those that are unvoiced • Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers. These are some of the popular speech recognition features Distance Metric Requirements Definition: Measure similarity of two frames of speech. The vector xt , yt contain the features from frames of two signals A distance measure should have the following properties: 0 d(xt,yt) (positive definiteness) 0 = d(xt,yt) iff xt = yt d(xt,yt) = d(xt,yt) (symmetry) d(xt,yt) d(xt,zt) + d(zt,yt) (triangle inequality) A speech distance metric should correlate with perceived distance. Perceptually-warped spectral features work well in practice 32 Which Distance Metric? • General Formula: array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)} • Assumption : There is no cost assessed for duplicate or eliminated frames. • Distance Formula: – Euclidian: sum the square of one metric minus another squared – Linear: sum the absolute value of the distance between features • Example of a distance metric using linear distance ∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight Which Local Distance Measure? where xt(f), and yt(f) are frame frequency values at time t; f is a feature index • Euclidean distance: • Mahalanobis distance: F 1 x ( f ) y ( f ) 2 t t f 0 Other distance measures Itakura-Saito distortion, COSH, likelihood ratio, etc… Note: we can weight the features by multiplying differences by weighting factors to emphasize/deemphasize certain features Dynamic Time Warping Termination Step Divide the total computed cost by a normalizing factor The normalizing factor is necessary to compare results between input speech and various templates to which it is compared 1. One quick and effective normalizing method divides by the number of frames in the template. 2. Another method is divide the result by the length of the path taken, where we adjust the length by the slope weights at each transition. This requires backtracking to sum the slope values, but can sometimes be more accurate. 35 Frame transition cost heuristics ½ P1=(1,1)(1,0) P2 ½ P2=(1,1) P3=(1,1)(0,1) ½ P1 P1 ½ 1 P2 P3 P1=(1,0) P2=(1,1) P3=(1,2) P3 Heuristic 1 Heuristic 2 • Path P and slope weight m determined heuristically • Paths considered backward from target frame • Larger weight values for less preferable paths • Optimal paths always go up, right (monotonically forward in time) • Only evaluate P if all frames have meaningful values (e.g. don’t evaluate a path if one frame is at time 1, because there 36 is no data for time 1). Dynamic Time Warping (DTW) Algorithm 1. Initialization (time 1 is first time frame) D(1,1) = d(1,1) min 2. Recursion D( x, y) D( x, y) ((x, y), ( x, y)) ( x, y) L ((x, y), ( x, y)) d ( xl , yl )m(l ) l 1 l index along valid pathfrom( x' , y' ) to ( x, y), from1 to L where (=zeta) is a function of previous distances and slopes 3. Termination result D ( xT , yT ) M sometimes defined as Tx, or Tx+Ty, T 2+ T 2)½ or (T x y M m ( k ) k 1 A convenient value for M is the length of the template. 37 Dynamic Time Warping (DTW) Example 2 3 2 2 2 2 3 2 3 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 3 3 1 1 1 1 2 3 3 1 1 3 3 3 3 3 6 7 8 D(1,1) = D(2,1) = D(3,1) = D(4,1) = 6 7 D(1,2) = D(2,2) = D(3,2) = D(4,2) = 7 9 D(2,3) = D(3,3) = D(4,3) = D(5,3) = 5 5 8 11 D(3,4) = D(4,4) = D(5,4) = D(6,4) = 6 9 D(4,5) = D(5,5) = D(6,5) = D(7,5) = D(4,6) = D(5,6) = D(6,6) = D(7,6) = 6 5 5 5 4 4 2 4 2 3 1 2 6 4 5 heuristic paths: 3 8 11 14 12 1 1 1 P1=(1,0) P2=(1,1) P3=(1,2) begin at (1,1), end at (7,6) 17 normalized distortion = 8/6 = 1.33 normalized distortion = 8/7 = 1.14 38 Dynamic Time Warping (DTW) Example 3 2 3 2 2 2 3 2 3 1 1 1 2 2 2 1 2 2 2 8 9 2 1 3 2 1 8 1 2 3 1 2 3 3 3 3 13 11 12 12 12 13 10 heuristic paths: 1 1 1 P1=(1,0) P2=(1,1) P3=(0,1) begin at (1,1), end at (6,6) D(4,1) = 9 … D(1,1) = 1 D(2,1) = 3 D(3,1) = 6 11 11 D(1,2) = 3 D(2,2) = 2 D(3,2) = 10 D(4,2) = 7 … 9 10 10 10 D(1,3) = 5 D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 … D(2,4) = 7 9 10 10 7 7 5 10 11 9 8 11 D(1,4) = 7 3 2 10 7 9 12 D(1,5) = 10 D(2,5) = 9 1 3 6 9 12 15 D(3,4) = 9 D(4,4) = 10 … D(3,5) = 10 D(4,5) = 10 … D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17 39 Dynamic Time Warping (DTW) Example 9 8 5 3 1 2 7 7 4 1 3 4 8 6 1 2 3 5 7 5 2 1 3 4 5 3 2 2 4 6 4 2 3 1 3 7 heuristic paths: ½ ½ 1 ½ P1=(1,1)(1,0) ½ P2=(1,1) P3=(1,1)(0,1) begin at (1,1), end at (6,6) D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(3,5) = D(4,5) = D(5,5) = D(6,5) = D(3,6) = D(4,6) = D(5,6) = D(6,6) = 40 Singularities • Assumption – The minimum distance comparing two signals only depends on the previous adjacent entries – The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-defined • Problem: The algorithm can compute incorrectly due to mismatched alignments • Possible solutions: – Compare based on the change of feature values between windows instead of the values themselves – Pre-process to eliminate the causes of the mismatches Possible Preprocessing • Normalize the energy of voiced audio: – Compute the energy of both signals – Multiply the larger by the percentage difference • Brick Wall Normalize the peaks and valleys: – Find the average peak and valley value – Set values larger than the average equal to the average • Normalize the pitch: Use PSOLA to align the pitch of the two signals • Remove duplicate frames: Auto correlate frames at pitch points • Implement a noise removal algorithm • Normalize the speaking rate Dynamic Time Warping/Hidden Markov Models • Dynamic Time Warping (DTW) o Comparing speech with a number of templates o The algorithm selects the template with the lowest normalized distortion to determine the recognized word. • Hidden Markov Models (HMMs) o Refines the DTW technology. o HMMs compare speech against “probabilistic templates” o HMMs compute most likely paths using probabilities 43 Phoneme Marking • Goal: Mark the start and end of phoneme boundaries • Research – Unsupervised text (language) independent algorithms have been proposed – Accuracy: 75% to 80%, which is 5-10% lower than supervised algorithms that make assumptions about the language • If successful, a database of phonemes can be used in conjunction with dynamic time warping to simplify the speech recognition problem