Inducing Structure for Perception a.k.a. Slav’s split&merge Hammer Slav Petrov Advisors: Dan Klein, Jitendra Malik Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berg The Main Idea Manually specified structure He was right. Observation Complex underlying process The Main Idea Automatically refined structure He was right. Observation Complex underlying process EM Manually specified structure Why Structure? t ethe c athe eh tthe gfa food od ocat o ddog nh ate e t and da Structure is important The dog and the cat ate the food. The cat ate the food and the dog. The dog ate the cat and the food. Syntactic Ambiguity Last night I shot an elephant in my pajamas. Visual Ambiguity Old or young? Three Peaks? Machine Learning Natural Language Processing Computer Vision No, One Mountain! Natural Language Processing Machine Learning Computer Vision Three Domains Syntax Scenes Speech Timeline TrecVid Scenes Learning Inference Learning Synthesis Decoding Speech Bayesian LearningInference Conditional Summer ISI Syntactic MT ‘07 Now ‘08 Syntax ‘09 Syntax Split & Merge Learning Nonparametric Bayesian Learning Speech Coarse-to-Fine Inference Syntax Generative vs. Conditional Learning Syntactic Machine Translation Language Modeling Scenes Learning accurate, compact and interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein Motivation (Syntax) Task: He was right. Why? Information Extraction Syntactic Machine Translation Treebank Parsing S NP VP . NP PRP NP DT NN … PRP She DT the … Treebank Grammar 1.0 0.5 0.5 1.0 1.0 Non-Independence Independence assumptions are often too strong. All NPs NPs under S 21% 11% 9% 9% NPs under VP 23% 9% 7% 6% NP PP DT NN PRP 4% NP PP DT NN PRP NP PP DT NN PRP The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering? Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X3 X7 X4 X5 X6 . Just like Forward-Backward for HMMs. He was right Backward Inside/Outside Scores Inside: Outside: Ax By Cz Ax By Cz Learning Latent Annotations (Details) E-Step: Ax By Cz M-Step: Limit of computational resources Overview Parsing accuracy (F1) 90 k=16 k=8 85 k=4 80 k=2 75 70 65 - Hierarchical Training - Adaptive Splitting - Parameter Smoothing k=1 60 50 250 450 650 850 1050 1250 Total Number of grammar symbols 1450 1650 Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4 Refinement of the DT tag DT Hierarchical refinement of the DT tag DT Hierarchical Estimation Results 90 Parsing accuracy (F1) 88 86 84 82 80 78 76 74 100 300 500 Model F1 1300 1500 1700 Baseline 87.3 Total Number of grammar symbols Hierarchical Training 88.4 700 900 1100 Refinement of the , tag Splitting all categories the same amount is wasteful: The DT tag revisited Oversplit? Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed. Adaptive Splitting (Details) True data likelihood: Approximate likelihood with split at n reversed: Approximate loss in likelihood: Adaptive Splitting Results 90 P a rs in g a c c u ra c y (F 1 ) 88 86 84 82 80 78 50% Merging Hierarchical Training 76 74 100 300 500 Flat Training Model F1 700 900 1100 1300 1500 1700 Previous 88.4 Total Number of grammar symbols With 50% Merging 89.5 0 LST ROOT X WHADJP RRC SBARQ INTJ WHADVP UCP NAC FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories 40 35 30 25 20 15 10 5 0 LST ROOT X WHADJP RRC SBARQ INTJ WHADVP UCP NAC FRAG CONJP SQ WHPP PRT SINV NX N P PRN 30 WHNP 35 QP 40 SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories VP PP 25 20 15 10 5 0 LST ROOT X WHADJP RRC NA C SBARQ INTJ WHADVP 10 UCP 15 NAC FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories 40 35 30 25 20 X 5 30 20 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories 70 60 50 40 PO S T O , 10 60 50 40 30 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories 70 R B VBx IN DT 20 10 70 60 50 40 30 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories NN P JJ NN S N N 20 10 Smoothing Heavy splitting can lead to overfitting Idea: Smoothing allows us to pool statistics Linear Smoothing Result Overview 90 P a rsin g a ccu ra cy (F 1 ) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training 74 100 300 500 700 Total Number of grammar symbols 900 1100 F1 Model Previous 89.5 With Smoothing 90.7 Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him Linguistic Candy Relative adverbs (RBR): RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later Cardinal Numbers (CD): CD-7 one two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 1 30 31 CD-9 78 58 34 Nonparametric PCFGs using Dirichlet Processes Percy Liang, Slav Petrov, Dan Klein and Michael Jordan Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein 1621 min [Goodman ‘97, Charniak&Johnson ‘05] Coarse-to-Fine Parsing Treebank Coarse grammar NP … VP NP-apple NP-1 VP-run VP-6 NP-17 NP-12 … VP-31 NP-dog NP-cat … NP-eat … … grammar RefinedRefined grammar Prune? For each chart item X[i,j], compute posterior probability: < threshold E.g. consider the span 5 to 12: coarse: refined: … QP NP VP … 1621 min 111 min (no search error) Hierarchical Pruning Consider again the span 5 to 12: coarse: … split in two: split in four: split in eight: … … QP NP VP … QP1 QP2 NP1 NP2 VP1 VP2 … … QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 … … … … … … … … … … … … … … … … … Intermediate Grammars DT X-Bar=G0 DT1 G1 Learning G2 G3 G4 G5 G= G6 DT1 DT2 DT2 DT3 DT4 DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 1621 min 111 min 35 min (no search error) State Drift (DT tag) the that this That That This …… That …… some this these some this …… these this …… these that …… these that …… some …… some …… EM Projected Grammars X-Bar=G0 0(G) G1 1(G) 2(G) G3 3(G) G4 4(G) G5 5(G) G= G6 G Projection i Learning G2 Estimating Projected Grammars Nonterminals? Easy: NP1 NP0 VP VP0 1 S0 S 1 Nonterminals in G Projection NP VP S Nonterminals in (G) Estimating Projected Grammars Rules? S1 NP1 VP1 S1 NP1 VP2 S1 NP2 VP1 S1 NP2 VP2 S2 NP1 VP1 S2 NP1 VP2 S2 NP2 VP1 S2 NP2 VP2 0.20 0.12 0.02 0.03 0.11 0.05 0.08 0.12 Rules in G ? S NP VP ??? Rules in (G) [Corazza & Satta ‘06] Estimating Projected GrammarsGrammars … S1 NP1 VP1 S1 NP1 VP2 S1 NP2 VP1 S1 NP2 VP2 S2 NP1 VP1 S2 NP1 VP2 S2 NP2 VP1 S2 NP2 VP2 0.20 0.12 0.02 0.03 0.11 0.05 0.08 0.12 S NP VP 0.56 Rules in (G) Rules in G … InfiniteTreebank tree distribution Calculating Expectations Nonterminals: ck(X): expected counts up to depth k Converges within 25 iterations (few seconds) Rules: 1621 min 111 min 35 min 15 min (no search error) Parsing times 60 % G1 12 % G2 7% G3 6% G4 6% G5 5% G= G6 4% Learning X-Bar=G0 Bracket Posteriors (after G0) Bracket Posteriors (after G1) (Final Chart) Bracket Posteriors (Movie) Bracket Posteriors (Best Tree) Parse Selection Parses: Derivations: -1 -1 -1 -2 -2 -1 -1 -2 -2 -1 -1 -1 Computing most likely unsplit tree is NP-hard: Settle for best derivation. Rerank n-best list. Use alternative objective function. -1 -2 Final Results (Efficiency) Berkeley Parser: 15 min 91.2 F-score Implemented in Java Charniak & Johnson ‘05 Parser 19 min 90.7 F-score Implemented in C Final Results (Accuracy) all F1 ENG Charniak&Johnson ‘05 (generative) 90.1 89.6 This Work 90.6 90.1 GER Dubey ‘05 76.3 - This Work 80.8 80.1 Chiang et al. ‘02 80.0 76.6 This Work 86.3 83.4 CHN ≤ 40 words F1 Conclusions (Syntax) Split & Merge Learning Hierarchical Training Adaptive Splitting Parameter Smoothing Hierarchical Coarse-to-Fine Inference Projections Marginalization Multi-lingual Unlexicalized Parsing Generative vs. Discriminative Conditional Estimation L-BFGS Iterative Scaling Conditional Structure Alternative Merging Criterion How much supervision? Syntactic Machine Translation Collaboration with ISI/USC: Use parse trees Use annotated parse trees Learn split synchronous grammars Speech Split & Merge Learning Coarse-to-Fine Decoding Speech Combined Generative + Conditional Learning Speech Synthesis Syntax Scenes Learning Structured Models for Phone Recognition Slav Petrov, Adam Pauls, Dan Klein Motivation (Speech) Phones: Words: ae n d y uh k uh d n t k ae r l eh s and you couldn’t care less Traditional Models d17=c(#-d-a) #-d-a d a1=c(d-a-d) d-a-d a d9a-d-# =c(a-d-#) d Start Triphones + Decision Tree Clustering Mixtures of Gaussians Begin - Middle - End Structure End Model Overview Traditional: Our Model: Differences to Grammars vs. vs. Refinement of the ih-phone Inference Coarse-To-Fine Variational Approximation Phone Classification Results Method Error Rate GMM Baseline (Sha and Saul, 2006) 26.0 % HMM Baseline (Gunawardana et al., 2005) 25.1 % SVM (Clarkson and Moreno, 1999) 22.4 % Hidden CRF (Gunawardana et al., 2005) 21.7 % This Paper 21.4 % Large Margin GMM (Sha and Saul, 2006) 21.1 % Phone Recognition Results Method Error Rate State-Tied Triphone HMM (HTK) (Young and Woodland, 1994) 27.1 % Gender Dependent Triphone HMM (Lamel and Gauvain, 1993) 27.1 % This Paper 26.1 % Bayesian Triphone HMM (Ming and Smith, 1998) 25.6 % Heterogeneous classifiers (Halberstadt and Glass, 1998) 24.4 % Confusion Matrix How much supervision? Hand-aligned Exact phone boundaries are known Automatically-aligned Only sequence of phones is known Generative + Conditional Learning Learn structure generatively Estimate Gaussians conditionally Collaboration with Fei Sha Speech Synthesis Acoustic phone model: Generative Accurate Models phone internal structure well Use it for speech synthesis! Large Vocabulary ASR ASR System = Acoustic Model + Decoder Coarse-to-Fine Decoder: Subphone Phone Phone Syllable Word Bigram … Scenes Split & Merge Learning Syntax Speech Decoding Scenes Motivation (Scenes) Seascape Sky Water Grass Rock Motivation (Scenes) Learning Oversegment the image Extract vertical stripes Extract features Train HMMs Inference Decode stripes Enforce horizontal consistency Alternative Approach Conditional Random Fields Pro: Vertical and horizontal dependencies learnt Inference more natural Contra: Computationally more expensive Timeline TrecVid Scenes Learning Inference Learning Synthesis Decoding Speech Bayesian LearningInference Conditional Summer ISI Syntactic MT ‘07 Now ‘08 Syntax ‘09 Results so far State of the art parser for different languages: Automatically learnt Simple & Compact Fast & Accurate Available for download Phone recognizer: Automatically learnt Competitive performance Good foundation for speech recognizer Proposed Deliverables Syntax Parser Speech Recognizer Speech Synthesizer Syntactic Translation Machine Scene Recognizer Thank You!