Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore. ctank@comp.nus.edu.sg Motivations • Humans are able to automatically segment words and sounds in speech with little difficulty. • The ability to automatically segment words and phonemes also useful in training speech recognition engines. Principle • Time-Delay Neural Network – Input nodes have shift registers that allow the TDNN to generalize not only between discrete input-output pairs, but also over time. – Ability to learn true word boundaries given reasonably good initial estimations. – We make use of this property for our work. Why TDNN? • Representational simplicity – Intuitively easy to understand what inputs to TDNN and outputs to TDNN represent. • Ability to generalize over time • Hidden Markov Models have been left out of this work for now. Time Delay Neural Networks • Diagram shows a 2input TDNN node. • Constrained weights allow generalization over time. Constrained Weights In1 (t-2) In1 (t-1) TDNN Neuron In1 (t) In1 (t+1) In1 (t+2) Constrained Weights In2 (t-2) In2 (t-1) In2 (t) In2 (t+1) In2 (t+2) Boundary Shift Algorithm • Initially: – The TDNN is trained on a small manually segmented set of data. – Given the expected number of words in a new, unseen utterance, the cepstral frames in the utterance is distributed evenly over all the words. • For example, if there are 2,000 frames and 10 expected words, each word is allocated 200 frames. • Convex-Hull and Spectral Variation Function methods may be used to estimate the number of words in the utterance. • For our experiments we manually counted the number of words in each utterance. Boundary Shift Algorithm 1. The minimally trained TDNN is retrained using both its original data and the new unseen data. After retraining, a variable-sized window is placed around each boundary. 2. – 3. A search is made within the window for the highest scoring frame. The boundary is shifted to that frame. – 4. Window is initially +/- 16 frames This search is allowed to search past boundaries into neighboring words. TDNN is retrained using new boundaries. Boundary Shift Algorithm 5. 6. Windows are adjusted by +/- 2 frames (i.e. reduced by a total of 4 frames), and steps 3 to 5 are repeated. Algorithm ends when boundary shifts are negligible, or windows shrink to 0 frames. Network Pruning • Limited training data lead to the problem of overfitting. • Three parameters are used to decide which TDNN nodes to prune. – Significance j(max), , which measures how much a particular node contributes to the final answer. A node with a small Significance value contributes little to the final answer and can be pruned. Network Pruning • Three parameters are used to prune the TDNN: – The variance j, which measures how much a particular node changes over all the inputs. A node that changes very little over all the inputs is not contributing to the learning, and can be removed. – Pairwise node distance ji, which measures how node changes with respect to another. A node that follows another node closely in value is redundant and can be removed. Network Pruning • Thresholds are set for each parameter. Nodes with parameters falling below these thresholds are pruned. • Selection of thresholds is critical. • Pruning is performed after the TDNN has been trained on the initial set for about 200 cycles. Experiments • TDNN Architecture – 27 Inputs • 13 dcep coefficients, 13 ddcep coefficients, power. – 5 input delays – 96 Hidden Nodes • Arbitrarily chosen, to be pruned later. – 2 Binary Output Nodes • Represents word start and end boundaries. Experiments • Data gathered from 6 speakers – 3 male, 3 female. – Solving task similar to CISD Trains Scheduling Problem (Ferguson 96). – About 20-30 minutes of speech used to train TDNN. – 20 utterances, previously unseen, chosen to evaluate performance. Experiment Results Performance Before Pruning • Results shown relative to hand-labeled samples. Inside Test Outside Test Precision: 66.88% Precision: 56.22% Recall: 67.33% Recall: 76.69% F-Number: 67.07% F-Number: 64.88% Experiment Results Performance After Pruning Inside Test Outside Test Precision: 66.03% Precision: 57.10% Recall: 61.41% Recall: 72.16% F-Number: 63.61% F-Number: 61.71% Example Utterances Subject: CK Utterance: Ok thanks, now I need to find out how long does it need to travel from Elmira to Corning (okay) (th-) (-anks) (now) (i need) (to) (find) (out how) (long) (does it need) (to) (travel) (f-) (-om) (emira) (to c-) (orning) Example Utterances Subject: CT Utterance: May I know how long it takes to travel from Elmira to Corning? (may i) (know how) (long) (does it) (take) (to tr-) (avel) (from) (el-) (-mira) (to) (corn-) (-ning) Deletion Errors • Most prominent in places framed by plosives. • Algorithm able to detect boundaries at ends of the phrase but not in middle, due to presence of ‘d’ plosives at the ends. Insertion Errors • Most prominent in places where a vowel is stretched. Recommendations for Further Work • Results presented are early research results, and are promising. • Future work will combine TDNN with other statistical methods like Expectation Maximization.