CS 479, section 1: Natural Language Processing

advertisement
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
CS 479, section 1:
Natural Language Processing
Lecture #16: Speech Recognition
Overview (cont.)
Thanks to Alex Acero (Microsoft Research), Jeff Adams (Nuance), Simon Arnfield
(Sheffield), Dan Klein (UC Berkeley), Mazin Rahim (AT&T Research) for many of the
materials used in this lecture.
Announcements
 Reading Report #6 on Young’s Overview
 Due: now
 Reading Report #7 on M&S 7
 Due: Friday
 Review Questions
 Typed list of 5 questions for Mid-term exam
review
 Due next Wednesday
Objectives
 Continue our overview of an approach to
speech recognition, picking up at acoustic
modeling
 See other examples of the source / channel
(noisy channel) paradigm for modeling
interesting processes
 Apply language models
Recall: Front End
Source
P( w)
Text
w
Noisy
Channel
Speech
x  x1 x2 ...xn
FE
Features
y  y1 y2 ... yq
ASR
Text
w*  w1 w2 ...wm
P( x | w)
 We want to predict a sentence 𝑤 ∗ given a feature vector 𝑦 =
𝐹𝐸(𝑥)
w*  arg max P( w) P( y | w)
w
Acoustic Modeling
 Goal:
Language
Model
Feature
Extraction
 Map acoustic feature vectors
into distinct linguistic units
 Such as phones, syllables,
words, etc.
Decoder,
Search
Acoustic
Model
w*  arg max P( w) P( y | w)
w
Word
Lexicon
Acoustic Trajectories
sh
f
s
k
t th
ch
p
AA
h
j
v
z
zh
d
y
ehg
un
dh
IH
UH
w
EH r
OOb
ng AEn
um
m
ur
UR
OY
OOH
ee
EE
EY ih
OW AW
AH
AY
ul
uh
AWH
l oh
OH
Acoustic Models:
Neighborhoods are not Points
 How do we describe what points in our “feature space” are
likely to come from a given phoneme?
 It’s clearly more complicated than just identifying a single
point.
 Also, the boundaries are not
“clean”.
 Use the normal distribution:
 Points are likely to lie near
the center.
 We describe the
distribution with the mean
& variance.
 Easy to compute with
Acoustic Models:
Neighborhoods are not Points (2)
 Normal distributions in M dimensions are analogous
 A.k.a. “Gaussians”
 Specify the mean point in M dimensions
 Like an M-dimensional “hill” centered around the mean point
 Specify the variances
(as Co-variance matrix)
 Diagonal gives the “widths”
of the distribution
in each direction
 Off-diagonal values describe the
“orientation”
 “Full covariance”
 possibly “tilted”
 “Diagonal covariance”
 not “tilted”
AMs: Gaussians don’t really cut it
 Consider the “AY” frames
in our example. How can
we describe these with an
(elliptical) Gaussian?
 A single (diagonal)
Gaussian is too big to be
helpful.
 Full-covariance Gaussians
are hard to train.
 We often use multiple
Gaussians (a.k.a. Gaussian
mixture models)
(1 dimensional) Gaussian Mixture Models
AMs: Phonemes are a path, not a
destination






Phonemes, like stories, have
beginnings, middles, and
ends.
This might be clear if you
think of how the “AY” sound
moves from a sort of “EH” to
an “EE”.
Even non-diphthongs show
these properties.
We often represent a
phoneme with multiple
“states”.
E.g. in our AY model, we
might have 4 states.
And each of these states is
modeled by a mixture of
Gaussians.
STATE 2
STATE 3
STATE 4
STATE 1
AMs: Whence & Whither
 It matters where you come from (whence)
and where you are going (whither).
 Phonetic contextual effects
 A way to model this is to use triphones
 I.e. Depend on the previous & following phonemes
 E.g. Our “AY” model should really be a silence-AY-S model
(… or pentaphones: use 2 phonemes before & after)
 So what we really need for our “AY” model is a:
 Mixture of Gaussians
 For each of multiple states
 For each possible set of predecessor & successor
phonemes
Hidden Markov Model (HMM)
 Captures:
 Transitions between hidden states
 Feature emissions as mixtures
of gaussians
 Spectral properties modeled by
a parametric random process
 i.e., a directed graphical model!
 Advantages:
 Powerful statistical method for a wide range of data and conditions
 Highly reliable for recognizing speech
 A collection of HMMs for each:
 sub-word unit type
 extraneous event: cough, um, sneeze, …
 More on HMMs coming up in the course after classification!
Anatomy of an HMM
 HMM for /AY/ in context of preceding silence, followed by /S/
0.2
0.5
sil-AY+S
[1]
0.8
0.3
sil-AY+S
[2]
0.7
0.2
sil-AY+S
[3]
0.8
HMMs as Phone Models
0.2
0.5
sil-AY+S
[1]
0.3
0.2
0.8 sil-AY+S 0.7 sil-AY+S 0.8
[2]
[3]
Words and Phones
w*  arg max P( w) P( y | w1 , w2 ,..., wn )
w
 arg max P( w) P( y | p1,1 p1,2 p1,3 , p2,1 p2,2 ... p2,5 ,..., pn,1 pn,2 ... pn,4 )
w
How do we know how to segment words into phones?
Language
Model
Word Lexicon
Goal:
•
•
Map sub-word units into words
Usual sub-word units are phone(me)s
Decoder,
Search
Feature
Extraction
Acoustic
Model
Lexicon: (CMUDict, ARPABET)
Phoneme Example Translation
AA odd
AA D
AE at
AE T
AH hut
HH AH T
AO ought
AO T
AW cow
K AW
AY
hide
HH AY D
B
be
B IY
CH cheese
CH IY Z
…
Properties:
•
•
Simple
Typically knowledge-engineered (not learned – shock!)
Word
Lexicon
Decoder
Source
P( w)
Text
w
Noisy
Channel
Speech
x  x1 x2 ...xn
FE
Features
y  y1 y2 ... yq
ASR
Text
w*  w1 w2 ...wm
P( x | w)
 Predict a sentence 𝑤 ∗ given a feature vector 𝑦 = 𝐹𝐸(𝑥)
w*  arg max P( w) P( y | w)
w
Decoding: as StateSpace Search
Language
Model
Feature
Extraction
Pattern
Classification
Acoustic
Model
Word
Lexicon
Decoding as Search





Viterbi – Dynamic Programming
Multi-pass
A* (“stack decoding”)
N-best
…
Viterbi: DP
Noisy Channel Applications
 Speech recognition (dictation, commands, etc.)
 text  neurons, acoustic signal, transmission
 acoustic waveforms  text
 OCR
 text  print, smudge, scan  image  text
 Handwriting recognition
 text  neurons, muscles, ink, smudge, scan  image 
text
 Spelling correction
 text  your spelling  mis-spelled text  text
 Machine Translation (?)
 text in target language  translation in head
 text in source language  text in target language
Noisy-Channel Models
 OCR
P(text | strokes)  P(text) P(strokes | text)
 Handwriting recognition
P(text | pixels)  P(text) P(pixels | text)
 Spelling Correction
P(text | mis-spelled text)  P(text) P(mis-spelled text | text)
 Translation?
P(english | french)  P(english) P(french | english)
What’s Next
 Upcoming lectures:
 Classification / categorization
 Naïve-Bayes models
 Class-conditional language models
Extra
Milestones in Speech Recognition
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Isolated Words
Connected Digits
Continuous Speech
Filter-bank analysis
Time-normalization
Dynamic programming
1967
Spoken dialog;
Multiple
modalities
Stochastic language
understanding
Finite-state machines
Statistical learning
Concatenative synthesis
Machine learning
Mixed-initiative dialog
Hidden Markov models
Stochastic Language
modeling
1972
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog
Continuous Speech
Speech
Understanding
Connected Words
Continuous Speech
Pattern recognition
LPC analysis
Clustering algorithms
Level building
1962
Large
Vocabulary;
Syntax,
Semantics,
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
1977
1982
Year
1987
1992
1997
2003
Dragon Dictate Progress
 WERR* from Dragon NaturallySpeaking version 7 to version 8 to version 9:







DOMAIN
US English:
UK English:
German:
French:
Dutch:
Italian:
Spanish:
78
27%
21%
16%
24%
27%
22%
26%
89
23%
10%
10%
14%
18%
14%
17%
* WERR means relative word error rate reduction on an in-house evaluation
set.
Results from Jeff Adams, ca. 2006
Crazy Speech Marketplace
Philips
Inso
IBM
Articulate
MedRemote
Kurzweil
ScanSoft
L&H
Nuance
etc.
Dragon
Nuance
etc.
Dictaphone
Speechworks
Dictaphone
Voice
Signal
Tegic
ca. 1980
ca. 2004
Year
Speech vs. text:
tokens vs. characters
 Speech recognition recognizes a sequence of “tokens”
taken from a discrete & finite set, called the lexicon.
 Informally, tokens correspond to words, but the
correspondence is inexact. In dictation applications, where
we have to worry about converting between speech & text,
we need to sort out a “token philosophy”:
 Do we recognize “forty-two” or “forty two” or “42” or “40 2”?
 Do we recognize “millimeters” or “mms” or “mm”?
 What about common words which can also be names, e.g.
“Brown” and “brown”?
 What about capitalized phrases like “Nuance Communications”
or “The White House” or “Main Street”?
 What multi-word tokens should be in the lexicon, like “of_the”?
 What do we do with complex morphologies or compounding?
Converting between tokens
& text
TOKEN PHILOSOPHY
TOKENS
TEXT
Profits rose to $28
million. See fig.
1a on p. 124.
TOKENIZATION
ITN
profits rose to
twenty eight million
dollars .\period see
figure one a\a on
page one twenty
four .\period
LEXICON
Three examples (Tokenization)
TEXT



P.J. O’Rourke said, "Giving
money and power to
government is like giving
whiskey and car keys to
teenage boys."
The 18-speed I bought sold on
www.eBay.com for $611.00,
including 8.5% sales tax.
From 1832 until August 15,
1838 they lived at No. 235
Main Street, "opposite the
Academy," and from there
they could see it all.
TOKENS
 PJ O'Rourke said ,\comma
"\open-quotes giving money and power
to government is like giving whiskey
and car keys to teenage boys
.\period "\close-quotes
 the eighteen speed I bought sold on
www.\WWW_dot eBay .com\dot_com for
six hundred and eleven dollars zero
cents ,\comma including eight
.\point five percent sales tax
.\period
 from one eight three two until the
fifteenth of August eighteen thirty
eight they lived at number two
thirty five Main_Street ,\comma
"\open-quotes opposite the Academy
,\comma "\close-quotes and from
there they could see it all .\period
Missing from speech: punctuation
 When people speak they don’t explicitly indicate
phrase and section boundaries instead listeners rely on
prosody and syntax to know where these boundaries
belong in dictation applications we normally rely on
speakers to speak punctuation explicitly how can we
remove that requirement
 When people speak, they don’t explicitly indicate
phrase and section boundaries.
 Instead, listeners rely on prosody and syntax to know
where these boundaries belong.
 In dictation applications, we normally rely on speakers to
speak punctuation explicitly.
 How can we remove that requirement?
Punctuation Guessing Example
 Punctuation Guessing
 As currently shipping in Dragon
 Targeted towards free, unpunctuated speech
My personal experience with camping has been rather
limited. Having lived overseas in a very urban
situation in which camping in the wilderness is not
really possible. My only chances at camping came
when I returned to the United States. My most
memory, I had two most memorable camping trips both
with my father. My first one was when I was a
preteen, and we went hiking on Bigalow mountain in
Maine, central western Maine. We went hiking for a
day took a trail that leads off of the Appalachian
Trail and goes down to the village of Stratton in
the township of Eustis, just north and west of
Sugarloaf U.S.A., the ski area.
Download