pptx - CUNY.edu

advertisement
A brief overview of
Speech Recognition
and
Spoken Language
Processing
Advanced NLP
Guest Lecture August 31
Andrew Rosenberg
Speech and NLP
• Communication in Natural Language
• Text:
– Carefully prepared
– Grammatical
– Machine readable
• Typos
• Sometimes OCR or handwriting issues
1
Speech and NLP
• Communication in Natural Language
• Speech:
– Spontaneous
– Less Grammatical
– Machine readable
• with > 10% error using on speech recognition.
2
NLP Tasks
•
•
•
•
•
•
Parsing
Name Tagging
Sentiment Analysis
Entity Coreference
Relation Extraction
Machine Translation
3
Speech Tasks
• Parsing
– Speech isn’t always grammatical
• Name Tagging
– If a name isn’t “in vocabulary” what do you do?
• Sentiment Analysis
– How the words are spoken helps.
• Entity Coreference
• Relation Extraction
• Machine Translation
– how can these handle misrecognition errors?
4
Speech Tasks
•
•
•
•
•
•
Speech Synthesis
Text Normalization
Dialog Management
Topic Segmentation
Language Identification
Speaker Identification and Verification
– Authorship and security
5
The traditional view
Text
Documents
Training
Text Processing
System
Text
Documents
Application
Named Entity Recognizer
6
The simplest approach
Text
Documents
Training
Text Processing
System
Transcribed
Documents
Application
Named Entity Recognizer
7
Speech is errorful text
Transcribed
Documents
Training
Text Processing
System
Transcribed
Documents
Application
Named Entity Recognizer
8
Speech signal can be used
Transcribed
Documents
Training
Text Processing
System
Transcribed
Documents
Application
Named Entity Recognizer
9
Hybrid speech signal and text
Training
Transcribed
Documents
Text
Documents
Text Processing
System
Transcribed
Documents
Application
Named Entity Recognizer
10
Speech Recognition
• Standard HMM speech recognition.
•
•
•
•
•
Front End
Acoustic Model
Pronunciation Model
Language Model
Decoding
11
Speech Recognition
Front End
Acoustic
Feature Vector
Acoustic Model
Phone
Likelihoods
Pronunciation Model
Word
Likelihoods
Language Model
Word
Sequence
12
Speech Recognition
Front End
Convert sounds into a
sequence of observation
vectors
Language Model
Calculate the probability of
a sequence of words
Acoustic Model
The probability of a set of
observations given a
phone label
Pronunciation Model
The probability of a
pronunciation given a word
13
Front End
• How do we convert a wave form into a
useful representation?
• We are looking for a vector of numbers
which describe the acoustic content
• Assuming 22kHz 16bit sound. Modeling
this directly is not feasible.
14
Discrete Cosine Transform
• Every wave can be decomposed into
component sine or cosine waves.
• Fast Fourier
Transform is used
to do this efficiently
15
Overlapping frames
• Spectrograms allow for visual inspection of
spectral information.
• We are looking for a compact, numerical
representation
10ms
10ms
10ms
10ms
10ms
16
Single Frame of FFT
Australian male /i:/ from “heed” FFT analysis window 12.8ms
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
17
Example Spectrogram
18
“Standard” Representation
• Mel Frequency Cepstral Coefficients
– MFCC
PreEmphasis
FFT
window
energy
12 MFCC
12 ∆ MFCC
12∆∆ MFCC
1 energy
1 ∆ energy
1 ∆∆ energy
Mel-Filter
Bank
log
12 MFCC
Deltas
FFT-1
19
Speech Recognition
Front End
Convert sounds into a
sequence of observation
vectors
Language Model
Calculate the probability of
a sequence of words
Acoustic Model
The probability of a set of
observations given a
phone label
Pronunciation Model
The probability of a
pronunciation given a word
20
Language Model
• What is the probability of a sequence of
words?
• Assume you have a vocabulary of V
words.
• How many possible sequences of N words
are there?
21
N-gram Language Modeling
• Simplify the calculation.
• Big simplifying assumption: Each word
is only dependent on the previous N-1
words.
22
N-gram Language Modeling
• Same question. Assume a V word
vocabulary, and an N word sequence.
How many “counts” are necessary?
23
General Language Modeling
• Any probability calculation can be used
here.
• Class based language models.
• e.g. Recurrent neural networks
24
Speech Recognition
Front End
Convert sounds into a
sequence of observation
vectors
Language Model
Calculate the probability of
a sequence of words
Acoustic Model
The probability of a set of
observations given a
phone label
Pronunciation Model
The probability of a
pronunciation given a word
25
Pronunciation Modeling
• Identify the likelihood of a phone sequence
given a word sequence.
• There are many simplifying assumptions in
pronunciation modeling.
1. The pronunciation of each word is
independent of the previous and following.
26
Dictionary as Pronunciation Model
• Assume each word has a single
pronunciation
I AY
CAT K AE T
THE DH AH
HAD H AE D
ABSURD AH B S ER D
YOU Y UH D
27
Weighted Dictionary as Pronunciation
Model
• Allow multiple pronunciations and weight
each by their likelihood
I AY
.4
I IH
.6
THE DH AH
.7
THE DH IY
.3
YOU Y UH
.5
YOU Y UW
.5
28
Grapheme to Phoneme conversion
• What about words that you have never
seen before?
• What if you don’t think you’ve seen every
possible pronunciation?
• How do you pronounce: “McKayla”? or
“Zoomba”?
• Try to learn the phonetics of the language.
29
Letter to Sound Rules
• Manually written rules that are able to convert one
or more letters to one or more sounds.
•
•
•
•
T -> /t/
H -> /h/
TH -> /dh/
E -> /e/
• These rules can get complicated based on the
surrounding context.
– K is silent when word initial and followed by N.
30
Automatic learning of Letter to Sound
rules
• First: Generate an alignment of letters and
sounds
T
E
X
-
T
T
EH
K
S
T
T
E
X
T
-
-
-
-
-
-
-
-
-
T
EH
K
S
T
31
Automatic learning of Letter to Sound
rules
• Second: Try to learn the mapping
automatically.
• Generate “Features” from the letter
sequence
• Use these feature to predict sounds
• Almost any machine learning technique
can be used.
– We’ll use decision trees as an example.
32
Decision Trees example
• Context: L1, L2, p, R1, R2
R1 = “h”
Yes
P
F
F
F
F
P
P
P
ø
ø
ø
ø
loophole
physics
telephone
graph
photo
Yes
P
No
loophole
No
L1 = “o”
F
F
F
F
physics
telephone
graph
photo
Yes
P
ø
ø
ø
peanut
pay
apple
apple
psycho
pterodactyl
pneumonia
R1 = consonant
apple
psycho
pterodactyl
pneumonia
P
P
No
peanut
pay
33
Decision Trees example
• Context: L1, L2, p, R1, R2 try “PARIS”
R1 = “h”
Yes
P
F
F
F
F
P
P
P
ø
ø
ø
ø
loophole
physics
telephone
graph
photo
Yes
P
No
loophole
No
L1 = “o”
F
F
F
F
physics
telephone
graph
photo
Yes
P
ø
ø
ø
peanut
pay
apple
apple
psycho
pterodactyl
pneumonia
R1 = consonant
apple
psycho
pterodactyl
pneumonia
P
P
No
peanut
pay
34
Decision Trees example
• Context: L1, L2, p, R1, R2 Now try “GOPHER”
R1 = “h”
Yes
P
F
F
F
F
P
P
P
ø
ø
ø
ø
loophole
physics
telephone
graph
photo
Yes
P
No
loophole
No
L1 = “o”
F
F
F
F
physics
telephone
graph
photo
Yes
P
ø
ø
ø
peanut
pay
apple
apple
psycho
pterodactyl
pneumonia
R1 = consonant
apple
psycho
pterodactyl
pneumonia
P
P
No
peanut
pay
35
Speech Recognition
Front End
Convert sounds into a
sequence of observation
vectors
Language Model
Calculate
Calculatethe
theprobability
probabilityof
ofa
a sequence
sequenceofofwords
words
Acoustic Model
The probability of a set of
observations given a
phone label
Pronunciation Model
The probability of a
pronunciation given a word
36
Acoustic Modeling
• Hidden markov model.
– Used to model the relationship between two
sequences.
37
Hidden Markov model
q1
q2
q3
x1
x2
x3
• In a Hidden Markov Model the state
sequence is unobserved.
• Only an observation sequence is available
38
Hidden Markov model
q1
q2
q3
x1
x2
x3
• Observations are MFCC vectors
• States are phone labels
• Each state (phone) has an associated GMM
modeling the MFCC likelihood
39
Training acoustic models
• TIMIT
– close, manual phonetic transcription
– 2342 sentences
• Extract MFCC vectors from each frame within
each phone
• For each phone, train a GMM using
Expectation Maximization.
• These GMM is the Acoustic Model.
– Common to use 8, or 16 Gaussian Mixture
Components.
40
Gaussian Mixture Model
41
HMM Topology for Training
• Rather than having one GMM per phone, it
is common for acoustic models to
represent each phone as 3 triphones
/r/
S1
S2
S3
S4
S5
42
Speech in Natural Language Processing
ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS
BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE
AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE
MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF
TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH
WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND
THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY
43
Speech in Natural Language Processing
Also, from the North Station...
(I think the Orange Line runs by there too so you can also catch the
Orange Line... )
And then instead of transferring
(um I- you know, the map is really obvious about this but)
Instead of transferring at Park Street, you can transfer at (uh what’s the
station name) Downtown Crossing and (um) that’ll get you back to the
Red Line just as easily.
44
Spoken Language Processing
Speech
Recognition
45
NLP system
IR
IE
QA
Summarization
Topic Modeling
Spoken Language Processing
ALSO FROM NORTH STATION I THINK THE ORANGE
LINE RUNS BY THERE TOO SO YOU CAN ALSO
CATCH THE ORANGE LINE AND THEN INSTEAD OF
TRANSFERRING UM I YOU KNOW THE MAP IS
REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF
TRANSFERRING AT PARK STREET YOU CAN
TRANSFER AT UH WHAT’S THE STATION NAME
DOWNTOWN CROSSING UM AND THAT’LL GET YOU
BACK TO THE RED LINE JUST AS EASILY
46
NLP system
IR
IE
QA
Summarization
Topic Modeling
Dealing with Speech Errors
ALSO FROM NORTH STATION I THINK THE ORANGE
LINE RUNS BY THERE TOO SO YOU CAN ALSO
CATCH THE ORANGE LINE AND THEN INSTEAD OF
TRANSFERRING UM I YOU KNOW THE MAP IS
REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF
TRANSFERRING AT PARK STREET YOU CAN
TRANSFER AT UH WHAT’S THE STATION NAME
DOWNTOWN CROSSING UM AND THAT’LL GET YOU
BACK TO THE RED LINE JUST AS EASILY
47
Robust NLP system
IR
IE
QA
Summarization
Topic Modeling
Automatic Speech Recognition
Assumption
ASR produces a “transcript” of Speech.
ALSO FROM NORTH STATION I THINK THE
ORANGE LINE RUNS BY THERE TOO SO
YOU CAN ALSO CATCH THE ORANGE LINE
AND THEN INSTEAD OF TRANSFERRING
UM I YOU KNOW THE MAP IS REALLY
OBVIOUS ABOUT THIS BUT INSTEAD OF
TRANSFERRING AT PARK STREET YOU
CAN TRANSFER AT UH WHAT’S THE
STATION NAME DOWNTOWN CROSSING
UM AND THAT’LL GET YOU BACK TO THE
RED LINE JUST AS EASILY
48
Automatic Speech Recognition
Assumption
ASR produces a “transcript” of Speech.
Also, from the North Station...
(I think the Orange Line runs by there too so you can
also catch the Orange Line... )
And then instead of transferring
(um I- you know, the map is really obvious about this
but)
Instead of transferring at Park Street, you can
transfer at (uh what’s the station name) Downtown
Crossing and (um) that’ll get you back to the Red
Line just as easily.
“Rich Transcription”
49
Speech as Noisy Text
50
Decrease WER
Increase Robustness
Speech
Recognition
Robust NLP system
IR
IE
QA
Summarization
Topic Modeling
Other directions for improvement.
Prosodic Analysis
Speech
Recognition
Use Lattices or N-Best
lists
51
Robust NLP system
IR
IE
QA
Summarization
Topic Modeling
Prosody
• Variation is production properties that lead
to changes in intended interpretation.
•
•
•
•
•
Pitch
Intensity
Duration, Rhythm, Speaking Rate
Spectral Emphasis
Pausing
52
Tasks that can use prosody
•
•
•
•
•
Part of Speech Tagging [Eidelman et al. 2010]
Parsing [Huang, et al. 2010]
Language Modeling [Su & Jelinek, 2008]
Pronunciation Modeling [Rosenberg 2012]
Acoustic Modeling [Chen, et al. 2006]
• Emotion Recognition [Lee, et al. 2009]
• Topic Segmentation
[Rosenberg & Hirschberg, 2006, Rosenberg, et al. 2007]
• Speaker Identification/Verification
[Leung, et al. 2008]
53
Symbolic vs. Direct Modeling
Symbolic
Acoustic Features
Task-Specific
Classifier
Prosodic Analysis
Direct
Acoustic Features
• Symbolic Modeling
–
–
–
–
Modular
Linguistically Meaningful
Perceptually Salient
Dimensionality Reduction
Task-Specific
Classifier
• Direct Modeling
– Appropriate to the Task
– Lower information loss
– General
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
54
ToBI (Tones and Break Indices)
• Based on Pierrehumbert’s “intonational
phonology”
Silverman et al. 1992
• Prosody is described by high (H) and low (L)
tones that are associated with prosodic events
(pitch accents, phrase accents, and boundary
tones) and break indices which describe the
degree of disjuncture between words.
– ToBI is inherently categorical in its description of
prosody
• ToBI variants exist for at least American English,
German, Japanese, Korean, Portuguese, Greek,
Catalan
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
55
ToBI Accenting
• Words are labeled as
containing a pitch accent or
not.
• There are five possible pitch
accent types (in SAE).
• High tones can be produced
in a compressed pitch range
– catathesis, or
“downstepping”.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
H*
L*
L*+H
L+H*
H+!H*
56
ToBI Phrasing
• ToBI describes phrasing as a hierarchy of two
levels.
– Intermediate phrases contain one or more words.
– Intonational phrases contain one or more
intermediate phrases.
• Word boundaries are marked with a degree
of disjuncture, or break index
– Break indices range from 0-4
– >3 intermediate phrase boundary
– 4 intonational phrase boundary.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
57
ToBI Phrase Ending Types
• Intermediate Phrase boundaries have associated
Phrase Accents describing the pitch movement
from the last accent to the phrase boundary
– Phrase Accents: H-, !H- or L-
• Intonational phrase boundaries have Boundary
Tones describing the pitch movement immediately
before the boundary
– Boundary Tones: H% or L%
L-L%
L-H%
H-H%
H-L%
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
!H-L%
58
ToBI Example (in Praat)
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
59
The Standard Corpus-Based Approach
• Identify labeled training data
• Decide what to label – syllables or words
• Extract aggregate acoustic features based on
the labeling region
• Train a supervised classifier
• Evaluate using cross-validation or a held-out test
set.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
60
The Standard Corpus-Based Approach
• Identify labeled training data
– Can we use unlabeled data?
• Decide what to label – syllables or words
• Extract aggregate acoustic features based on
the labeling region
• Train a supervised model
• Evaluate using cross-validation or a held-out test
set.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
61
The Standard Corpus-Based Approach
• Identify labeled training data
• Decide what to label – syllables or words
– Are these the only options? [Context and Region of analysis]
• Extract aggregate acoustic features based on
the labeling region
• Train a supervised model
• Evaluate using cross-validation or a held-out test
set.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
62
The Standard Corpus-Based Approach
• Identify labeled training data
• Decide what to label – syllables or words
• Extract aggregate acoustic features based on
the labeling region
– There are always new features to explore [Shape
Modeling]
• Train a supervised model
• Evaluate using cross-validation or a held-out test
set.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
63
The Standard Corpus-Based Approach
• Identify labeled training data
• Decide what to label – syllables or words
• Extract aggregate acoustic features based on
the labeling region
• Train a supervised model
– Unsupervised and Semi-supervised approaches
– Structured ensembles of classifiers
• Evaluate using cross-validation or a held-out test
set.
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
64
The Standard Corpus-Based Approach
• Identify labeled training data
• Decide what to label – syllables or words
• Extract aggregate acoustic features based on
the labeling region
• Train a supervised model
• Evaluate using cross-validation or a held-out test
set.
– Is this a reasonable approximation of generalization
performance?
Interspeech 2011 Tutorial M1 - More
Than Words Can Say
65
Processing Speech
• Processing speech is difficult
– There are errors in transcripts.
– It is not grammatical
– The style (genre) of speech is different from
the available (text) training data.
• Processing speech is easy
– Speaker information
– Intention (sarcasm, certainty, emotion, etc.)
– Segmentation
66
Questions & Comments
• What topic was clearest?
– murkiest?
• What was the most interesting?
– least interesting?
• andrew@cs.qc.cuny.edu
• http://speech.cs.qc.cuny.edu
• http://eniac.cs.qc.cuny.edu/andrew
67
Download