Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks

advertisement
Learning of Word Boundaries in
Continuous Speech using Time Delay
Neural Networks
Colin Tan
School of Computing,
National University of Singapore.
ctank@comp.nus.edu.sg
Motivations
• Humans are able to automatically segment
words and sounds in speech with little
difficulty.
• The ability to automatically segment words
and phonemes also useful in training speech
recognition engines.
Principle
• Time-Delay Neural Network
– Input nodes have shift registers that allow the
TDNN to generalize not only between discrete
input-output pairs, but also over time.
– Ability to learn true word boundaries given
reasonably good initial estimations.
– We make use of this property for our work.
Why TDNN?
• Representational simplicity
– Intuitively easy to understand what inputs to
TDNN and outputs to TDNN represent.
• Ability to generalize over time
• Hidden Markov Models have been left out
of this work for now.
Time Delay Neural Networks
• Diagram shows a 2input TDNN node.
• Constrained weights
allow generalization
over time.
Constrained
Weights
In1
(t-2)
In1
(t-1)
TDNN
Neuron
In1
(t)
In1
(t+1)
In1
(t+2)
Constrained
Weights
In2
(t-2)
In2
(t-1)
In2
(t)
In2
(t+1)
In2
(t+2)
Boundary Shift Algorithm
• Initially:
– The TDNN is trained on a small manually segmented set of data.
– Given the expected number of words in a new, unseen utterance, the
cepstral frames in the utterance is distributed evenly over all the words.
• For example, if there are 2,000 frames and 10 expected words, each word is
allocated 200 frames.
• Convex-Hull and Spectral Variation Function methods may be used to
estimate the number of words in the utterance.
• For our experiments we manually counted the number of words in each
utterance.
Boundary Shift Algorithm
1.
The minimally trained TDNN is retrained using both its
original data and the new unseen data.
After retraining, a variable-sized window is placed
around each boundary.
2.
–
3.
A search is made within the window for the highest
scoring frame. The boundary is shifted to that frame.
–
4.
Window is initially +/- 16 frames
This search is allowed to search past boundaries into
neighboring words.
TDNN is retrained using new boundaries.
Boundary Shift Algorithm
5.
6.
Windows are adjusted by +/- 2 frames (i.e. reduced by a
total of 4 frames), and steps 3 to 5 are repeated.
Algorithm ends when boundary shifts are negligible, or
windows shrink to 0 frames.
Network Pruning
• Limited training data lead to the problem of overfitting.
• Three parameters are used to decide which TDNN
nodes to prune.
– Significance j(max), , which measures how much a particular
node contributes to the final answer. A node with a
small Significance value contributes little to the final
answer and can be pruned.
Network Pruning
• Three parameters are used to prune the TDNN:
– The variance j, which measures how much a particular
node changes over all the inputs. A node that changes very
little over all the inputs is not contributing to the learning,
and can be removed.
– Pairwise node distance ji, which measures how node
changes with respect to another. A node that follows another
node closely in value is redundant and can be removed.
Network Pruning
• Thresholds are set for each parameter. Nodes with
parameters falling below these thresholds are pruned.
• Selection of thresholds is critical.
• Pruning is performed after the TDNN has been trained on
the initial set for about 200 cycles.
Experiments
• TDNN Architecture
– 27 Inputs
• 13 dcep coefficients, 13 ddcep coefficients, power.
– 5 input delays
– 96 Hidden Nodes
• Arbitrarily chosen, to be pruned later.
– 2 Binary Output Nodes
• Represents word start and end boundaries.
Experiments
• Data gathered from 6 speakers
– 3 male, 3 female.
– Solving task similar to CISD Trains Scheduling
Problem (Ferguson 96).
– About 20-30 minutes of speech used to train
TDNN.
– 20 utterances, previously unseen, chosen to
evaluate performance.
Experiment Results
Performance Before Pruning
• Results shown relative to hand-labeled samples.
Inside Test
Outside Test
Precision:
66.88%
Precision:
56.22%
Recall:
67.33%
Recall:
76.69%
F-Number:
67.07%
F-Number:
64.88%
Experiment Results
Performance After Pruning
Inside Test
Outside Test
Precision:
66.03%
Precision:
57.10%
Recall:
61.41%
Recall:
72.16%
F-Number:
63.61%
F-Number:
61.71%
Example Utterances
Subject: CK
Utterance: Ok thanks, now I need to find out how long does
it need to travel from Elmira to Corning
(okay) (th-) (-anks) (now) (i need) (to) (find) (out how)
(long) (does it need) (to) (travel) (f-) (-om) (emira) (to c-)
(orning)
Example Utterances
Subject: CT
Utterance: May I know how long it takes to travel from
Elmira to Corning?
(may i) (know how) (long) (does it) (take) (to tr-) (avel) (from) (el-) (-mira) (to) (corn-) (-ning)
Deletion Errors
• Most prominent in places framed by plosives.
• Algorithm able to detect boundaries at ends of the phrase
but not in middle, due to presence of ‘d’ plosives at the
ends.
Insertion Errors
• Most prominent in places where a vowel is stretched.
Recommendations for Further
Work
• Results presented are early research results,
and are promising.
• Future work will combine TDNN with other
statistical methods like Expectation
Maximization.
Download