Qualificational Examination (Quals)

advertisement
Inducing Structure for Perception
a.k.a. Slav’s split&merge Hammer
Slav Petrov
Advisors: Dan Klein, Jitendra Malik
Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berg
The Main Idea
Manually specified
structure
He was right.
Observation
Complex underlying process
The Main Idea
Automatically refined
structure
He was right.
Observation
Complex underlying process
EM
Manually
specified
structure
Why Structure?
t ethe
c athe
eh
tthe
gfa
food
od
ocat
o ddog
nh
ate
e t and
da
Structure is important
The dog and the cat
ate the food.
The cat ate the food
and the dog.
The dog ate the
cat and the food.
Syntactic Ambiguity
Last night I shot an
elephant in my pajamas.
Visual Ambiguity
Old or young?
Three Peaks?
Machine
Learning
Natural
Language
Processing
Computer
Vision
No, One Mountain!
Natural
Language
Processing
Machine
Learning
Computer
Vision
Three Domains
Syntax
Scenes
Speech
Timeline
TrecVid
Scenes
Learning Inference
Learning
Synthesis
Decoding
Speech
Bayesian
LearningInference Conditional
Summer
ISI
Syntactic MT
‘07
Now
‘08
Syntax
‘09
Syntax
Split & Merge
Learning
Nonparametric
Bayesian
Learning
Speech
Coarse-to-Fine
Inference
Syntax
Generative
vs. Conditional
Learning
Syntactic
Machine
Translation
Language
Modeling
Scenes
Learning accurate, compact and
interpretable Tree Annotation
Slav Petrov, Leon Barrett, Romain Thibaux,
Dan Klein
Motivation (Syntax)
 Task:
He was right.
 Why?
 Information Extraction
 Syntactic Machine Translation
Treebank Parsing
S  NP VP .
NP  PRP
NP  DT NN
…
PRP  She
DT  the
…
Treebank
Grammar
1.0
0.5
0.5
1.0
1.0
Non-Independence
Independence assumptions are often too strong.
All NPs
NPs under S
21%
11%
9%
9%
NPs under VP
23%
9%
7%
6%
NP PP
DT NN
PRP
4%
NP PP
DT NN
PRP
NP PP
DT NN
PRP
The Game of Designing a Grammar
 Annotation refines base treebank symbols to
improve statistical fit of the grammar
 Parent annotation [Johnson ’98]
The Game of Designing a Grammar
 Annotation refines base treebank symbols to
improve statistical fit of the grammar
 Parent annotation [Johnson ’98]
 Head lexicalization [Collins ’99, Charniak ’00]
The Game of Designing a Grammar
 Annotation refines base treebank symbols to
improve statistical fit of the grammar
 Parent annotation [Johnson ’98]
 Head lexicalization [Collins ’99, Charniak ’00]
 Automatic clustering?
Learning Latent Annotations
Forward
EM algorithm:
 Brackets are known
 Base categories are known
 Only induce subcategories
X1
X2
X3
X7
X4
X5
X6
.
Just like Forward-Backward
for HMMs.
He
was
right
Backward
Inside/Outside Scores
Inside:
Outside:
Ax
By Cz
Ax
By
Cz
Learning Latent Annotations (Details)
 E-Step:
Ax
By Cz
 M-Step:
Limit of
computational
resources
Overview
Parsing accuracy (F1)
90
k=16
k=8
85
k=4
80
k=2
75
70
65
- Hierarchical Training
- Adaptive Splitting
- Parameter Smoothing
k=1
60
50
250
450
650
850
1050
1250
Total Number of grammar symbols
1450
1650
Refinement of the DT tag
DT
DT-1
DT-2
DT-3
DT-4
Refinement of the DT tag
DT
Hierarchical refinement of the DT tag
DT
Hierarchical Estimation Results
90
Parsing accuracy (F1)
88
86
84
82
80
78
76
74
100
300
500
Model
F1
1300
1500
1700
Baseline
87.3
Total Number of grammar symbols
Hierarchical Training 88.4
700
900
1100
Refinement of the , tag
 Splitting all categories the same amount is
wasteful:
The DT tag revisited
Oversplit?
Adaptive Splitting
 Want to split complex categories more
 Idea: split everything, roll back splits which
were least useful
Adaptive Splitting
 Want to split complex categories more
 Idea: split everything, roll back splits which
were least useful
Adaptive Splitting
 Evaluate loss in likelihood from removing each
split =
Data likelihood with split reversed
Data likelihood with split
 No loss in accuracy when 50% of the splits are
reversed.
Adaptive Splitting (Details)
 True data likelihood:
 Approximate likelihood with split at n reversed:
 Approximate loss in likelihood:
Adaptive Splitting Results
90
P a rs in g a c c u ra c y (F 1 )
88
86
84
82
80
78
50% Merging
Hierarchical Training
76
74
100
300
500
Flat Training
Model
F1
700
900
1100
1300
1500
1700
Previous
88.4
Total Number of grammar symbols
With 50% Merging 89.5
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
N
P
PRN
30
WHNP
35
QP
40
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
VP
PP
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
NA
C
SBARQ
INTJ
WHADVP
10
UCP
15
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
X
5
30
20
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
60
50
40
PO
S
T
O
,
10
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
R
B
VBx
IN
DT
20
10
70
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
NN
P
JJ
NN
S
N
N
20
10
Smoothing
 Heavy splitting can lead to overfitting
 Idea: Smoothing allows us to pool
statistics
Linear Smoothing
Result Overview
90
P a rsin g a ccu ra cy (F 1 )
88
86
84
82
80
50% Merging and Smoothing
78
50% Merging
Hierarchical Training
76
Flat Training
74
100
300
500
700
Total Number of grammar symbols
900
1100 F1
Model
Previous
89.5
With Smoothing 90.7
Linguistic Candy
 Proper Nouns (NNP):
NNP-14
Oct.
Nov.
Sept.
NNP-12
John
Robert
James
NNP-2
J.
E.
L.
NNP-1
Bush
Noriega
Peters
NNP-15
New
San
Wall
NNP-3
York
Francisco
Street
 Personal pronouns (PRP):
PRP-0
It
He
I
PRP-1
it
he
they
PRP-2
it
them
him
Linguistic Candy
 Relative adverbs (RBR):
RBR-0
further
lower
higher
RBR-1
more
less
More
RBR-2
earlier
Earlier
later
 Cardinal Numbers (CD):
CD-7
one
two
Three
CD-4
1989
1990
1988
CD-11
million
billion
trillion
CD-0
1
50
100
CD-3
1
30
31
CD-9
78
58
34
Nonparametric PCFGs using
Dirichlet Processes
Percy Liang, Slav Petrov,
Dan Klein and Michael Jordan
Improved Inference
for Unlexicalized Parsing
Slav Petrov and Dan Klein
1621 min
[Goodman ‘97, Charniak&Johnson ‘05]
Coarse-to-Fine Parsing
Treebank
Coarse grammar
NP … VP
NP-apple NP-1 VP-run
VP-6
NP-17 NP-12
…
VP-31
NP-dog NP-cat
…
NP-eat
…
…
grammar
RefinedRefined
grammar
Prune?
For each chart item X[i,j], compute posterior probability:
< threshold
E.g. consider the span 5 to 12:
coarse:
refined:
…
QP
NP
VP
…
1621 min
111 min
(no search error)
Hierarchical Pruning
Consider again the span 5 to 12:
coarse:
…
split in two:
split in four:
split in eight:
…
…
QP
NP
VP
…
QP1 QP2 NP1 NP2 VP1 VP2 …
…
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Intermediate Grammars
DT
X-Bar=G0
DT1
G1
Learning
G2
G3
G4
G5
G= G6
DT1
DT2
DT2
DT3
DT4
DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8
1621 min
111 min
35 min
(no search error)
State Drift (DT tag)
the
that
this
That
That
This
…… That
……
some
this
these some
this
…… these
this
…… these
that
…… these
that
…… some
…… some
……
EM
Projected Grammars
X-Bar=G0
0(G)
G1
1(G)
2(G)
G3
3(G)
G4
4(G)
G5
5(G)
G= G6
G
Projection i
Learning
G2
Estimating Projected Grammars
Nonterminals?
Easy:
NP1 NP0
VP
VP0 1
S0
S
1
Nonterminals in G
Projection 
NP
VP
S
Nonterminals in (G)
Estimating Projected Grammars
Rules?
S1  NP1 VP1
S1  NP1 VP2
S1  NP2 VP1
S1  NP2 VP2
S2  NP1 VP1
S2  NP1 VP2
S2  NP2 VP1
S2  NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
Rules in G
?
S  NP VP
???
Rules in (G)
[Corazza & Satta ‘06]
Estimating Projected
GrammarsGrammars
…
S1  NP1 VP1
S1  NP1 VP2
S1  NP2 VP1
S1  NP2 VP2
S2  NP1 VP1
S2  NP1 VP2
S2  NP2 VP1
S2  NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
S  NP VP 0.56
Rules in (G)
Rules in G
…
InfiniteTreebank
tree distribution
Calculating Expectations
 Nonterminals:
 ck(X): expected counts up to depth k
 Converges within 25 iterations (few seconds)
 Rules:
1621 min
111 min
35 min
15 min
(no search error)
Parsing times
60 %
G1
12 %
G2
7%
G3
6%
G4
6%
G5
5%
G= G6
4%
Learning
X-Bar=G0
Bracket Posteriors (after G0)
Bracket Posteriors (after G1)
(Final Chart)
Bracket Posteriors (Movie)
Bracket Posteriors (Best Tree)
Parse Selection
Parses:
Derivations:
-1
-1
-1
-2
-2
-1
-1
-2
-2
-1
-1
-1
Computing most likely unsplit tree is NP-hard:
 Settle for best derivation.
 Rerank n-best list.
 Use alternative objective function.
-1
-2
Final Results (Efficiency)
 Berkeley Parser:
 15 min
 91.2 F-score
 Implemented in Java
 Charniak & Johnson ‘05 Parser
 19 min
 90.7 F-score
 Implemented in C
Final Results (Accuracy)
all
F1
ENG
Charniak&Johnson ‘05 (generative)
90.1
89.6
This Work
90.6
90.1
GER
Dubey ‘05
76.3
-
This Work
80.8
80.1
Chiang et al. ‘02
80.0
76.6
This Work
86.3
83.4
CHN
≤ 40 words
F1
Conclusions (Syntax)
 Split & Merge Learning
 Hierarchical Training
 Adaptive Splitting
 Parameter Smoothing
 Hierarchical Coarse-to-Fine Inference
 Projections
 Marginalization
 Multi-lingual Unlexicalized Parsing
Generative vs. Discriminative
 Conditional Estimation
 L-BFGS
 Iterative Scaling
 Conditional Structure
 Alternative Merging Criterion
How much supervision?
Syntactic Machine Translation
 Collaboration with ISI/USC:
 Use parse trees
 Use annotated parse trees
 Learn split synchronous grammars
Speech
Split & Merge
Learning
Coarse-to-Fine
Decoding
Speech
Combined
Generative
+ Conditional
Learning
Speech
Synthesis
Syntax
Scenes
Learning Structured Models for
Phone Recognition
Slav Petrov, Adam Pauls,
Dan Klein
Motivation (Speech)
Phones:
Words:
ae n d y uh k uh d n t k ae r l eh s
and
you couldn’t care
less
Traditional Models
d17=c(#-d-a)
#-d-a
d
a1=c(d-a-d)
d-a-d
a
d9a-d-#
=c(a-d-#)
d
Start
Triphones + Decision Tree Clustering
Mixtures of Gaussians
Begin - Middle - End Structure
End
Model Overview
Traditional:
Our Model:
Differences to Grammars
vs.
vs.
Refinement of the ih-phone
Inference
 Coarse-To-Fine
 Variational Approximation
Phone Classification Results
Method
Error Rate
GMM Baseline (Sha and Saul, 2006)
26.0 %
HMM Baseline (Gunawardana et al., 2005)
25.1 %
SVM (Clarkson and Moreno, 1999)
22.4 %
Hidden CRF (Gunawardana et al., 2005)
21.7 %
This Paper
21.4 %
Large Margin GMM (Sha and Saul, 2006)
21.1 %
Phone Recognition Results
Method
Error Rate
State-Tied Triphone HMM (HTK)
(Young and Woodland, 1994)
27.1 %
Gender Dependent Triphone HMM
(Lamel and Gauvain, 1993)
27.1 %
This Paper
26.1 %
Bayesian Triphone HMM
(Ming and Smith, 1998)
25.6 %
Heterogeneous classifiers
(Halberstadt and Glass, 1998)
24.4 %
Confusion Matrix
How much supervision?
 Hand-aligned
 Exact phone boundaries are known
 Automatically-aligned
 Only sequence of phones is known
Generative + Conditional Learning
 Learn structure generatively
 Estimate Gaussians conditionally
 Collaboration with Fei Sha
Speech Synthesis
 Acoustic phone model:
 Generative
 Accurate
 Models phone internal structure well
 Use it for speech synthesis!
Large Vocabulary ASR
 ASR System = Acoustic Model + Decoder
 Coarse-to-Fine Decoder:
 Subphone  Phone
 Phone  Syllable  Word  Bigram  …
Scenes
Split & Merge
Learning
Syntax
Speech
Decoding
Scenes
Motivation (Scenes)
Seascape
Sky
Water
Grass
Rock
Motivation (Scenes)
Learning
 Oversegment the image
 Extract vertical stripes
 Extract features
 Train HMMs
Inference
 Decode stripes
 Enforce horizontal consistency
Alternative Approach
 Conditional Random Fields
 Pro:
 Vertical and horizontal dependencies learnt
 Inference more natural
 Contra:
 Computationally more expensive
Timeline
TrecVid
Scenes
Learning Inference
Learning
Synthesis
Decoding
Speech
Bayesian
LearningInference Conditional
Summer
ISI
Syntactic MT
‘07
Now
‘08
Syntax
‘09
Results so far
 State of the art parser for different languages:




Automatically learnt
Simple & Compact
Fast & Accurate
Available for download
 Phone recognizer:
 Automatically learnt
 Competitive performance
 Good foundation for speech recognizer
Proposed Deliverables





Syntax Parser
Speech Recognizer
Speech Synthesizer
Syntactic Translation Machine
Scene Recognizer
Thank You!
Download