Joint Models with Missing Data for Semi-Supervised Learning Jason Eisner NAACL Workshop Keynote – June 2009 1 Outline 1. Why use joint models? 2. Making big joint models tractable: Approximate inference and training by loopy belief propagation 3. Open questions: Semi-supervised training of joint models 2 The standard story x Task y p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y) 3 Some running examples x Task y p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y) E.g., in low-resource languages sentence lemma parse (with David A. Smith) morph. paradigm (with Markus Dreyer) 4 Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? Shared parameters via joint model e.g., noisy channel: p(x,y) = p(y) * p(x|y) Estimate p(x,y) to have appropriate marginal p(x) This affects the conditional distrib p(y|x) 5 sample of p(x) 6 For any x, can now recover cluster c that probably generated it A few supervised examples may let us predict y from c E.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c) (joint model!) few params sample of p(x) 7 Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? Shared parameters via joint model e.g., noisy channel: p(x,y) = p(y) * p(x|y) Estimate p(x,y) to have appropriate marginal p(x) This affects the conditional distrib p(y|x) Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.) But we do need to choose a model family for p(x,y) 8 NLP + ML = ??? x structured input (may be only partly observed, so infer x, too) Task p(y|x) model depends on features of<x,y> (sparse features?) or features of <x,z,y> where z are latent (so infer z, too) y structured output (so already need joint inference for decoding, e.g.,dynamic programming) 9 Each task in a vacuum? Task1 x1 y1 Task2 x2 x3 x4 Task3 Task4 y2 y3 y4 10 Solved tasks help later ones? (e.g, pipeline) x Task1 z1 Task2 Task3 Task4 z2 z3 y 11 Feedback? x Task1 z1 Task2 Task3 Task4 z2 What if Task3 isn’t solved yet and we have little <z2,z3> training data? z3 y 12 Feedback? x Task1 z1 Task2 Task3 Task4 z2 What if Task3 isn’t solved yet and we have little <z2,z3> training data? z3 y Impute <z2,z3> given x1 and y4! 13 A later step benefits from many earlier ones? x Task1 z1 Task2 Task3 Task4 z2 z3 y 14 A later step benefits from many earlier ones? And conversely? x Task1 z1 Task2 Task3 Task4 z2 z3 y 15 We end up with a Markov Random Field (MRF) x z1 Φ1 Φ2 z2 Φ3 Φ4 z3 y 16 Variable-centric, not task-centric = p(x,z1,z2,z3,y) =(1/Z) Φ1(x,z1) Φ2(z1,z2) Φ4(z3,y) Φ3(x,z1,z2,z3) 1 Φ5(y) 1 x z Φ Φ2 z2 Φ3 Φ4 z3 y Φ5 17 Familiar MRF example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … v v v … preferred find tags Observed input sentence (shaded) 18 Familiar MRF example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … v a n … preferred find tags Observed input sentence (shaded) 19 Familiar MRF example: CRF ”Binary” factor that measures compatibility of 2 adjacent tags v v 0 n 2 a 0 n 2 1 3 a 1 0 1 v v 0 n 2 a 0 n 2 1 3 a 1 0 1 Model reuses same parameters at this position … find … preferred tags 20 Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags can’t be adj 21 Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 22 Familiar MRF example: CRF “Unary” factor evaluates this tag Different unary factor at each position … … v 0.3 n 0.02 a 0 find v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 23 Familiar MRF example: CRF p(v a n) is proportional to the product of all factors’ values on v a n … v v 0 n 2 a 0 v a 1 0 1 v v 0 n 2 a 0 a v 0.3 n 0.02 a 0 find n 2 1 3 n 2 1 3 a 1 0 1 … n v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 24 Familiar MRF example: CRF NOTE: This is not just a pipeline of single-tag prediction tasks (which might work ok in well-trained supervised case …) p(v a n) is proportional to the product of all factors’ values on v a n … v v 0 n 2 a 0 v a 1 0 1 v v 0 n 2 a 0 a v 0.3 n 0.02 a 0 find n 2 1 3 n 2 1 3 a 1 0 1 = … 1*3*0.3*0.1*0.2 … … n v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 25 Task-centered view of the world x Task1 z1 Task2 Task3 Task4 z2 z3 y 26 Variable-centered view of the world = p(x,z1,z2,z3,y) =(1/Z) Φ1(x,z1) Φ2(z1,z2) Φ4(z3,y) Φ3(x,z1,z2,z3) 1 Φ5(y) 1 x z Φ Φ2 z2 Φ3 Φ4 z3 y Φ5 27 Variable-centric, not task-centric Throw in any variables that might help! Model and exploit correlations 28 semantics lexicon (word types) entailment correlation inflection cognates transliteration abbreviation neologism language evolution tokens sentences N translation alignment editing quotation discourse context resources speech misspellings,typos formatting entanglement annotation 29 Back to our (simpler!) running examples sentence lemma parse (with David A. Smith) morph. paradigm (with Markus Dreyer) 30 Parser projection sentence sentence translation parse little direct training data much more training data (with David A. Smith) parse parse of translation 31 Parser projection Auf I diese did not Frage habe unfortunately ich leider receive an keine answer Antwort to bekommen this question 32 Parser projection sentence little direct training data parse word-to-word alignment translation much more training data parse of translation 33 Parser projection Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to this question 34 Parser projection sentence little direct training data word-to-word alignment translation much more training data parse need an interesting model parse of translation 35 Parses are not entirely isomorphic Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to null head-swapping this question siblings monotonic 36 Dependency Relations + “none of the above” 37 Parser projection Typical test data (no translation observed): sentence parse word-to-word alignment translation parse of translation 38 Parser projection Small supervised training set (treebank): sentence parse word-to-word alignment translation parse of translation 39 Parser projection Moderate treebank in other language: sentence parse word-to-word alignment translation parse of translation 40 Parser projection Maybe a few gold alignments: sentence parse word-to-word alignment translation parse of translation 41 Parser projection Lots of raw bitext: sentence parse word-to-word alignment translation parse of translation 42 Parser projection Given bitext, sentence parse word-to-word alignment translation parse of translation 43 Parser projection Given bitext, try to impute other variables: sentence parse word-to-word alignment translation parse of translation 44 Parser projection Given bitext, try to impute other variables: Now we have more constraints on the parse … sentence parse word-to-word alignment translation parse of translation 45 Parser projection Given bitext, try to impute other variables: Now we have more constraints on the parse … which should help us train the parser. sentence parse word-to-word alignment translation parse of translation We’ll see how belief propagation naturally handles this. 46 English does help us impute Chinese parse Seeing noisy output of an English WSJ parser fixes these Chinese links 中国 在 基本 建设 方面 , 开始 利用 国际 金融 组织 的 贷款 进行 国际性 竞争性 招标 采购 Complement verbs swap objects Subject attaches to intervening noun N P J N N , V V N N N ‘s N V J N N N The corresponding bad versions found without seeing the English parse China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement 47 Which does help us train a monolingual Chinese parser 48 (Could add a rd 3 language …) parse of translation’ translation’ alignment sentence parse alignment parse of translation translation alignment 49 (Could add world knowledge …) sentence parse word-to-word alignment translation parse of translation 50 (Could add bilingual dictionary …) dict (since incomplete, treat as partially observed var) sentence parse N word-to-word alignment translation parse of translation 51 Dynamic Markov Random Field sentence parse alignment parse of translation translation Auf I diese did not Frage habe unfortunately ich leider receive an Note: These are structured vars Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,…) Thus, # of finegrained variables keine Antwort bekommen & factors varies by NULL example (but all examples share a single finite answer to this question parameter vector) 52 Back to our running examples sentence lemma parse (with David A. Smith) morph. paradigm (with Markus Dreyer) 53 Morphological paradigm inf xyz 1st Sg 2nd Sg 3rd Sg 1st Pl 2nd Pl 3rd Pl Present Past 54 Morphological paradigm inf werfen 1st Sg werfe warf 2nd Sg wirfst warfst 3rd Sg wirft warf 1st Pl werfen warfen 2nd Pl werft warft 3rd Pl werfen warfen Present Past 55 Morphological paradigm as MRF inf xyz 1st Sg 2nd Sg 3rd Sg 1st Pl 2nd Pl 3rd Pl Present Each factor Past is a sophisticated weighted FST 56 # observations per form (fine-grained semisupervision) inf 9,393 1st Sg 285 2nd Sg 166 3rd Sg 1410 1124 1st Pl 1688 673 2nd Pl 1275 9 3rd Pl 1688 673 Present Past 1124 undertrained 4 rare! rare! Question: Does joint inference help? 57 gelten ‘to hold, to apply’ inf gelten 1st Sg gelte galt 2nd Sg giltst galtst 3rd Sg gilt galt 1st Pl gelten galten 2nd Pl geltet galtet or: galtest 58 abbrechen ‘to quit’ inf abbrechen 1st Sg abbreche or: breche ab 2nd Sg abbrichst or: brichst ab 3rd Sg abbricht or: bricht ab 1st Pl abbrechen or: brechen ab 2nd Pl abbrecht or: brecht ab abbrechen abbrach or: brach ab abbrachst or: brachst ab abbrach or: brach ab abbrachen or: brachen ab abbracht or: bracht ab abbrachen 59 gackern ‘to cackle’ inf gackern 1st Sg gackere gackerte 2nd Sg gackerst gackertest 3rd Sg gackert gackerte 1st Pl gackern gackerten 2nd Pl gackert gackertet 60 werfen ‘to throw’ inf werfen 1st Sg werfe warf 2nd Sg wirfst warfst 3rd Sg wirft warf 1st Pl werfen warfen 2nd Pl werft warft 61 Preliminary results … joint inference helps a lot on the rare forms Hurts on the others. Can we fix?? (Is it because our joint decoder is approx? Or because semisupervised training is hard and we need a better method for it?) 62 Outline 1. Why use joint models in NLP? 2. Making big joint models tractable: Approximate inference and training by loopy belief propagation 3. Open questions: Semisupervised training of joint models 63 Key Idea! We’re using an MRF to coordinate the solutions to several NLP problems Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables (individual words, tags, links) Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even undecidable to work with 64 Why we need approximate inference MRFs great for n-way classification (maxent) Also good for predicting sequences v a n find preferred tags alas, forward-backward algorithm only allows n-gram features Also good for dependency parsing …find preferred links… alas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness) 65 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 1 before you 2 before you 3 before you 4 before you 5 behind you 4 behind you 3 behind you 2 behind you adapted from MacKay (2003) textbook 5 before you 1 behind you 66 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 2 before you Belief: Must be 22 +11 +33 = 6 of us 3 only see my incoming behind you messages adapted from MacKay (2003) textbook 67 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 1 before you Belief: Belief: Must be Must be 1 1 +11 +44 = 22 +11 +33 = 6 of us 6 of us 4 only see my incoming behind you messages adapted from MacKay (2003) textbook 68 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 69 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 70 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 71 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here adapted from MacKay (2003) textbook Belief: Must be 14 of us 72 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here adapted from MacKay (2003) textbook Belief: Must be 14 of us 73 Great ideas in ML: Forward-Backward In the CRF, message passing = forward-backward belief message α … v v 0 n 2 a 0 v 7 n 2 a 1 n 2 1 3 α v 1.8 n 0 a 4.2 av 3 1n 1 0a 6 1 β message v 2v n v1 0 a n7 2 a 0 n 2 1 3 β a 1 0 1 v 3 n 6 a 1 … v 0.3 n 0 a 0.1 find preferred tags 74 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief α v 5.4 n 0 a 25.2 β v 3 n 1 a 6 … v 3 n 1 a 6 find v 2 n 1 a 7 … v 0.3 n 0 a 0.1 preferred tags 75 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief Red messages not independent? v 5.4` Graph becomes loopy Pretend they are! α n 0 a 25.2` β v 3 n 1 a 6 … v 3 n 1 a 6 find v 2 n 1 a 7 … v 0.3 n 0 a 0.1 preferred tags 76 MRF over string-valued variables! inf xyz 1st Sg 2nd Sg 3rd Sg 1st Pl 2nd Pl 3rd Pl Present Each factor Past is a sophisticated weighted FST 77 MRF over string-valued variables! inf xyz 1st Sg 2nd Sg 3rd Sg 1st Pl 2nd Pl 3rd Pl What are these messages? Probability distributions over strings … Represented by weighted FSAs Constructed by finite-state operations Parameters trainable using finite-state methods Warning: FSAs can get larger and larger; Present Past must prune back using k-best or variational approx Each factor is a sophisticated weighted FST 78 Key Idea! We’re using an MRF to coordinate the solutions to several NLP problems Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables (individual words, tags, links) Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even undecidable to work with We just saw this for morphology; now let’s see it for parsing 79 Local factors in a graphical model Back to simple variables … v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links … find preferred links … 80 Local factors in a graphical model Back to simple variables … v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars t f f f f … find t preferred links … 81 Local factors in a graphical model Back to simple variables … v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse f f t t f … find f preferred links … 82 Local factors in a graphical model Back to simple variables … v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f t t f … find t (cycle) preferred f links … 83 Local factors in a graphical model Back to simple variables … v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f Another illegal parse t t f … find t (cycle) t preferred (multiple parents) links … 84 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation But what if the best assignment isn’t a tree?? t 2 f 1 as before, goodness of this link can depend on entire observed input context t 1 f 2 t 1 f 8 t 1 f 3 … find t 1 f 6 preferred t 1 f 2 some other links aren’t as good given this input sentence links … 104 Global factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree … this is a “hard constraint”: factor is either 0 or 1 find preferred ffffff ffffft fffftf … fftfft … tttttt 0 0 0 … 1 … 0 links … 105 Global factors for parsing optionally require the tree to be projective crossing links) So what factors shall we multiply to define (no parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 So far, this is equivalent to edge-factored parsing (McDonald et al. 2005). t f f f f … find t preferred ffffff ffffft fffftf … fftfft … tttttt 0 0 0 … 1 … 0 we’re legal! 64 entries (0/1) links … Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we!106 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent t f t f 1 1 t 1 3 t … find preferred links … 107 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent no-cross t f t f 1 1 t 1 0.2 t … find preferred links by … 108 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree Second-order effects: factors on 2 variables … this is a “hard constraint”: factor is either 0 or 1 grandparent no-cross coordination with other parse & alignment hidden POS tags siblings subcategorization … find preferred links by … 109 Exactly Finding the Best Parse …find preferred links… but to allow fast dynamic programming or MST parsing, only use single-edge features With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming grandparents grandp. + sibling bigrams POS trigrams sibling pairs (non-adjacent) O(n5) O(n4) O(n3g6) … O(2n) Non-projective: O(n2) by minimum spanning tree NP-hard • any of the above features • soft penalties for crossing links • pretty much anything else! 110 Two great tastes that taste great together You got belief propagation in my dynamic programming! You got dynamic programming in my belief propagation! 111 Loopy Belief Propagation for Parsing Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” … … find preferred links … 113 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops … Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links … find preferred links … 114 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” ? TREE factor … find ffffff 0 ffffft 0 fffftf 0 … … fftfft 1 … … preferred tttttt 0 links … 115 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” TREE factor But this is the outside probability of green link! ffffff 0 ffffft 0 TREE factor computes all outgoing messages at once fffftf 0 (given all incoming messages) … … fftfft 1 … … Projective case: total O(n3) time by inside-outside tttttt 0 Non-projective: total O(n3) time by inverting Kirchhoff ? … find preferred links … matrix (Smith & Smith, 2007) 116 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” But this is the outside probability of green link! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n3) time by inside-outside Non-projective: total O(n3) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant). 117 Some interesting connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008) Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994) Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation As far as we know, our parser is the first use of global constraints in sum-product BP. And nearly the first use of BP in natural language processing. 118 Runtimes for each factor type (see paper) Factor type Tree Proj. Tree degree runtime O(n2) O(n3) O(n2) O(n3) count total 1 O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 TOTAL Additive, not multiplicative! + O(n) = O(n ) 3 per iteration 119 Runtimes for each factor type (see paper) Factor type Tree Proj. Tree degree runtime O(n2) O(n3) O(n2) O(n3) count total 1 O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 + O(n) Each “global” factorAdditive, coordinates an unbounded # of variables TOTAL O(n ) not multiplicative! = Standard belief propagation would take exponential time 3 to iterate over all configurations of those variables See paper for efficient propagators 120 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Tree+Link Danish 85.5 Dutch 87.3 English 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 121 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Tree+Link Danish 85.5 Dutch 87.3 English 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 86.0 84.5 90.2 86.1 87.6 90.2 exact, slow Best projective parse with all factors doesn’t fix +hill-climbing enough edges 122 Time vs. Projective Search Error iterations iterations …DP 140 iterations Compared with O(n4) DP Compared with O(n5) DP 123 Summary of MRF parsing by BP Output probability defined as product of local and global factors Let local factors negotiate via “belief propagation” Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming Compare reranking or stacking Each iteration takes total time O(n3) or even O(n2); see paper Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy 125 Outline 1. Why use joint models in NLP? 2. Making big joint models tractable: Approximate inference and training by loopy belief propagation 3. Open questions: Semisupervised training of joint models 126 Training with missing data is hard! Semi-supervised learning of HMMs or PCFGs: ouch! A stronger model helps (McClosky et al. 2007, Cohen et al. 2009) Merialdo: Just stick with the small supervised training set Adding unsupervised data tends to hurt So maybe some hope from good models @ factors And from having lots of factors (i.e., take cues from lots of correlated variables at once; cf. Yarowsky et al.) Naïve Bayes would be okay … Variables with unknown values can’t hurt you. They have no influence on training or decoding. But can’t help you, either! And indep. assumptions are flaky. So I’d like to keep discussing joint models … 127 Case #1: Missing data that you can’t impute sentence parse word-to-word alignment translation parse of translation Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translation Or 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised) 128 Case #2: Missing data you can impute, but maybe badly inf xyz 1st Sg 2nd Sg 3rd Sg 1st Pl 2nd Pl 3rd Pl Present Each factor Past is a sophisticated weighted FST 129 Case #2: Missing data you can impute, but maybe badly inf xyz 1st Sg 2nd isSg This where simple cases of EM go wrong Could reduce to case #1 and throw away these variables 3rd Sg Or: Damp messages from imputed variables to the extent you’re not in them 1stconfident Pl Requires confidence estimation. (cf. strapping) Pl versions: Confidence depends in a fixed way on time, or on 2nd Crude entropy of belief at that node, or on length of input sentence. 3rd ButPl could train a confidence estimator on supervised data to pay attention to all sorts of things! Present Past Correspondingly, scale up features for related missing-data tasks since the damped are “partially missing” Each factor is data a sophisticated weighted FST 130