Document

advertisement
Dealing with Unknown Unknowns
(in Speech Recognition)
Hynek Hermansky
Processing speech in multiple parallel processing streams, which attend to different
parts of signal space and use different strengths of prior top-down knowledge is
proposed for dealing with unexpected signal distortions and with unexpected lexical
items. Some preliminary results in machine recognition of speech are presented.
There are things we do not know we
don't know.
Donald Rumsfeld
indian
white
man
Letter to Editor
J.Acoust.Soc.Am.
Research field of “mad inventors or untrustworthy engineers”
“Funding artificial intelligence is real stupidity”
"After growing wildly for years, the field of computing appears to be reaching its infancy.”
•
•
•
•
supervised the Bell Labs team which built the first transistor
President’s Science Advisory Committee
developed the concept of pulse code modulation
designed and launched the first active communications satellite
.... should people continue work towards speech recognition by
machine ? Perhaps it is for people in the field to decide.
Why am I working in this field?
Why did I climbed Mt. Everest?
Because it is there !
-Sir Edmund Hilary
Spoken language is one of the most amazing
accomplishments of human race.
access to information
• voice interactions with machines
• extracting information from speech data !
Problems faced in machine recognition of speech reveal
basic limitations of all information technology !
We speak in order to hear, in
order to be understood.
-Roman Jakobson
production,
perception,
cognition,..
knowledge
data
Speech recognition
…a problem of maximum likelihood decoding
-Frederick Jelinek
Hidden Markov Model
Stochastic recognition of speech
Ŵ = argmaxW p(x|W) P(W)
Ŵ – estimated speech utterance
p(x|Wi) - likelihoods of acoustic models of speech sounds,
the models are derived by training on very large amounts of speech data
P(W) - prior probabilities of speech utterances (language model),
model estimated from large amounts of data (typically text)
“Unknown unknowns” in machine recognition of speech
• distortions not seen in the training data of the acoustic model
• words that are not expected by the language model
One possible way of dealing with unknown unknowns
Information in speech is coded in many redundant dimensions.
Not all dimensions get corrupted at the same time.
•
signal
information
fusion
decision
•
•
Parallel information-providing
streams, each carrying different
redundant dimensions of a given
target.
A strategy for comparing the
streams.
A strategy for selecting “reliable”
streams.
Stream formation
Comparing the streams ?
•
•
• various correlation (distance)
measures
•
Different perceptual modalities
Different processing channels
within each modality
Bottom-up and top-down
dominated channels
Selecting reliable streams ?????
Perceptual Data
Fletcher et al
P(e ) = Õ P(ei )
i
Probability of error of recognition of full-band speech is given by a
product of probabilities of errors in subbands
Boothroyd and Nittrouer
Probability of error of recognition in contexts is given by a product of
probabilities of errors of recognition without context and probability of
error in channel which provides information about the context
Final error dominated by the channel with smallest error !
Evidence for different
processing strategies
Processing streams
Auditory cortical receptive fields
• Different carrier
frequencies
• Different carrier
bandwidths
• Different spectral and
temporal resolutions
• Different modalities
different carrier frequencies
different temporal different spectral
resolutions
resolutions
frequency
A large number of parallel
processing streams
• Different prior biases
time [s]
from N. Mesgarani
Evidence for equally powerful bottom-up
and top-down streams ?
From the subjective point of view, there is nothing special
that would differentiate between the top-down and
bottom-up dominated processing streams. All streams
provide information for a decision. When all streams
provide non-conflicting information, all this information is
used for the decision. When the context allows for multiple
interpretations of the sensory input, the bottom-up
processing stream dominates. When the sensory input gets
corrupted by noise, the top-down dominated stream fills in
for the corrupted bottom-up input.
Hermansky 2013
Monitoring Performance
P(e ) = Õ P(ei )
i
Pmiss = (1-P1)(1-P2)
P1
P2
observer - false positives and
negatives are possible
Pmiss_observed ≠ (1-P1)(1-P2)
Could it be that we know when we know ?
Performance Monitoring in Sensory Perception
sparse
100 %
judgement
dense
human judgment
(adopted from Smith et al 2003)
similar data available for monkeys, dolphins, rats,…
Machine ?
not sure
training data
0%
model of
the
output
classifier
low
picture density
high
compare
models
update
Knowing when one knows !
testing data
classifier
model of
the
output
frequency
Spectrogram
Posteriogram
up to 1 s
data
preprocessing
artificial neural
network
trained on
large amounts
of labeled data
time
ANN
fusion
phoneme
posteriors
Fusion of streams of different carrier frequencies
[Hermansky et al 1996, Li et al 2013]
Preliminary results using multi-stream speech
recognition on noisy TIMIT data
• Processing is done in multiple parallel streams
• Signal corruption affects only some streams
• Performance monitor selects N best streams for further processing
Subband 2
ANN
Performance
Monitor
ANN
ANN
Fusion
...
Subband 5
ANN
form 31
processing
streams
...
...
...
…...
speech
signal
Filterank
Subband 1
selecting
N best
streams
Average
Viterbi
decoder
Phoneme recognition error rates on noisy TIMIT data
environment
conventional
proposed
best by hand
clean
31 %
28 %
25 %
car at 0 dB SNR
54 %
38 %
35 %
phone
sequence
conventional “deep” net
up to
100 ms
many
processing
layers
all available
frequency
components
(transformed)
posterior
probabilities
of speech
sounds
time
“long, wide and deep”net
many
processing
layers
up to 1000 ms
high frequency
components
get
info1
mid frequency
components
get
infoi
low frequency
components
get
infoN
time
“smart”
fusion
(transformed)
posterior
probabilities
of speech
sounds
Conclusions we would eventually like
to make
• Recognition should be done in parallel
processing streams, each attending to a
particular aspect of the signal and using
different levels of top-down expectations
• Discrepancy among the streams indicates
an unexpected signal
• Suppressing corrupted streams can
increase robustness to unexpected inputs
Machine Emulation of Human Speech Communication
Fred Jelinek
Speech recognition
…a problem of maximum likelihood decoding
Roman Jakobson
information and communication
theory, machine learning, large data,….
We speak, in order to be heard, in order to be
understood
human communication, speech production,
perception, neuroscience, cognitive
science,..
John Pierce
Gordon Moore
The complexity for minimum component
costs has increased at a rate of roughly a
factor of two per year…
tools
..devise a clear, simple,
definitive experiments. So
a science of speech can
grow, certain step by
certain step.
also John Pierce:
(Speech recognition is so far (1969) field of) mad inventors or untrustworthy engineers
(because machine needs) intelligence and knowledge of language comparable to those of a
native speaker .
Sounds like a good goal to aim at !
THANKS !
Jont Allen
Hamed Ketabdar
Feipeng Li
Vijay Peddinti Ehsan Variani Harish Mallidi Samuel Thomas
Nima Mesgarani
Misha Pavel
Download