Query-Specific Learning and Inference for Probabilistic Graphical Models Anton Chechetka Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011 Carnegie Mellon Motivation Fundamental problem: to reason accurately about noisy high-dimensional data with local interactions 2 Sensor networks • noisy: sensors fail noise in readings • high-dimensional: many sensors, (temperature, humidity, …) per sensor • local interactions: nearby locations have high correlations 3 Hypertext classification • noisy: automated text understanding is far from perfect • high-dimensional: a variable for every webpage • local interactions: directly linked pages have correlated topics 4 Image segmentation • noisy: local information is not enough camera sensor noise compression artifacts • high-dimensional: a variable for every patch • local interactions: cows are next to grass, airplanes next to sky 5 Probabilistic graphical models Noisy Probabilistic inference P(Q | E ) query high-dimensional data with local interactions P(Q, E ) P( E ) evidence a graph to encode only direct interactions over many variables 6 Graphical models semantics Factorized distributions 1 P X Z X f F Graph structure X4 X1 X6 X5 X2 X7 X3 X X 3 , X 4 , X 5 separator X are small subsets of X compact representation 7 Graphical models workflow Factorized distributions 1 P X Z X f F Learn/construct structure Learn/define parameters Graph structure X4 X1 X6 X5 X2 X7 X3 Inference P(Q|E=E) 8 Graph. models fundamental problems Learn/construct structure NP-complete Learn/define parameters exp(|X|) Inference #P-complete (exact) NP-complete (approx) Compounding errors P(Q|E=E) 9 Domain knowledge structures don’t help Domain knowledge-based structures do not support tractable inference (webpages) 10 This thesis: general directions New algorithms for learning and inference in graphical models to make answering the queries better Emphasizing the computational aspects of the graph Learn accurate and tractable models Compensate for reduced expressive power with exact inference and optimal parameters Gain significant speedups Inference speedups via better prioritization of computation Estimate the long-term effects of propagating information through the graph Use long-term estimates to prioritize updates 11 Thesis contributions Learn accurate and tractable models In the generative setting P(Q,E) [NIPS 2007] In the discriminative setting P(Q|E) [NIPS 2010] Speed up belief propagation for cases with many nuisance variables [AISTATS 2010] 12 Generative learning query goal learning goal P(Q, E ) P(Q | E ) P( E ) Useful when E is not known in advance Sensors fail unpredictably Measurements are expensive (e.g. user time), want adaptive evidence selection 13 Tractable vs intractable models workflow Tractable models learn simple tractable structure from domain knowledge + data optimal parameters, exact inference approx. P(Q|E=E) Intractable models construct intractable structure from domain knowledge learn intractable structure from data approximate algs: no quality guarantees approx. P(Q|E=E) 14 Tractability via low treewidth 7 3 4 1 5 2 6 Treewidth: size of largest clique in a triangulated graph Exact inference exponential in treewidth (sum-product) Treewidth NP-complete to compute in general Low-treewidth graphs are easy to construct Convenient representation: junction tree Other tractable model classes exist too 15 Junction trees Cliques connected by edges with separators Running intersection property X1,X4,X5 Most likely junction tree X1,X5 of given treewidth >1 is NP-complete We will look for good approximations X ,X ,X 1 2 5 X4,X5 C1 X1,X5 C2 X4,X5,X6 X1,X3,X5 C4 C5 X1,X2 7 3 4 1 5 X1,X2,X7 C3 6 2 16 Independencies in low-treewidth distributions P(X) factorizes according to a JT P(C , E ) PC (X ) P S C , C , conditional independencies hold I X , X | S 0 conditional mutual information E works in the other way too! KLP || P(C , E ) X X4,X5,X6 I X , X | S S S X1,X3,X5 X = X2 X3 X7 X = X4 X6 X1,X4,X5 C X1,X5 S X1,X2,X5 C X1,X2,X7 17 Constraint-based structure learning KLP || P(C , E ) X I X , X | S S S I(X , X X Look | S3)for < JTs where this holds (constraint-based structure learning) S8 S1: X1X2 S2: X1X3 X S3: X1X4 … Sm: Xn-1Xn all candidate separators X X X1 X4 S1 S7 C1 C4 S3 C2 C5 C3 all variables X partition remaining variables into weakly dependent subsets find consistent junction tree 18 Mutual information complexity I(X , X- | S) = H(X | S) - H(X | X- S3) everything except for X conditional entropy I(X , X- | S) depends on all assignments to X: exp(|X|) complexity in general Our contribution: polynomial-time upper bound 19 Mutual info upper bound: intuition hard I(A,B | C)=?? A B D F Only look at small subsets D, F Poly number of small subsets Poly complexity for every pair |DF| k Any conclusions about I(A,B|C)? In general, no If a good junction tree exists, yes 20 Contribution: mutual info upper bound A B D |DF| treewidth+1 F Suppose an -JT of treewidth k for P(ABC) exists: I X , X | S S S Theorem: Let = max I(D, F | C) for |DF| k+1 Then I(A, B | C) |ABC| ( + ) 21 Mutual info upper bound: complexity Direct computation: complexity exp(|ABC|) Our upper bound: O(|AB|treewidth + 1) small subsets A exp(|C|+ treewidth) time each D |C| = treewidth for structure learning B F |DF| treewidth+1 polynomial(|ABC|) complexity 22 Guarantees on learned model quality Theorem: Suppose a strongly connected -JT of treewidth k for P(X) exists. Then our alg. will with probability at least (1-) find a JT s.t. KLP || PJT (k 1) X ( 2) using log( X / ) O 2 poly samples X samples and O 2 k 3 quality guarantee log( 1 / ) time. 2 poly time Corollary: strongly connected junction trees are PAC-learnable 23 Related work Reference Model Guarantees Time [Bach+Jordan:2002] tractable local poly(n) [Chow+Liu:1968] tree global O(n2 log n) [Meila+Jordan:2001] tree mix local O(n2 log n) [Teyssier+Koller:2005] compact local poly(n) [Singh+Moore:2005] all global exp(n) [Karger+Srebro:2001] tractable const-factor poly(n) [Abbeel+al:2006] compact PAC poly(n) [Narasimhan+Bilmes:2004] tractable PAC exp(n) our work tractable PAC poly(n) [Gogate+al:2010] tractable with PAC high treewidth poly(n) 24 Test log-likelihood better Results – typical convergence time good results early on in practice 25 Results – log-likelihood better OBS local search in limited in-degree Bayes nets Chow-Liu most likely JTs of treewidth 1 Karger-Srebro constant-factor approximation JTs our method 26 Conclusions A tractable upper bound on conditional mutual info Graceful quality degradation and PAC learnability guarantees Analysis on when dynamic programming works [in the thesis] Dealing with unknown mutual information threshold [in the thesis] Speedups preserving the guarantees Further speedups without guarantees 27 Thesis contributions Learn accurate and tractable models In the generative setting P(Q,E) [NIPS 2007] In the discriminative setting P(Q|E) [NIPS 2010] Speed up belief propagation for cases with many nuisance variables [AISTATS 2010] 28 Discriminative learning query goal learning goal P(Q, E ) P(Q | E ) P( E ) Useful when variables E are always the same Non-adaptive, one-shot observation Image pixels scene description Document text topic, named entities Better accuracy than generative models 29 Discriminative log-linear models feature (domain knowledge) weight (learn from data) 1 P(Q | E , w) exp w f Q , E Z ( E , w) evidence-dependent normalization Evidence Don’t sum over all values of E Don’t model P(E) f34 f12 No need for structure over E Query 30 Model tractability still important Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting Tractability is determined by the structure over query 31 Simple local models: motivation query Locally almost linear Q Q=f(E) E evidence Exploiting evidence values overcomes the expressive power deficit of simple models We will learn local tractable models 32 Context-specific independence no edge Observation #2: use evidence values at test time to tune the structure of the models, do not commit to a single tractable model 33 Low-dimensional dependencies in generative structure learning Generative structure learning often relies only on low-dimensional marginals Junction trees: decomposable scores separators cliques LLH (C, S) H(S) H(C) SS CC ?? Low-dimensional independence tests: I ( A, B | S ) Small changes to structure quick score recomputation Discriminative structure learning: need inference in full model for every datapoint even for small changes in structure 34 Leverage generative learning Observation #3: generative structure learning algorithms have very useful properties, can we leverage them? 35 Observations so far Discriminative setting has extra information, including evidence values at test time Want to use to learn local tractable models Good structure learning algorithms exist for generative setting that only require low-dimensional marginals P(Q) Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights 36 Evidence-specific CRF overview Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights Local conditional density estimators P(Q | E) Evidence value E=E Generative structure learning algorithm Feature weights w P(Q | E=E) Tractable structure for E=E Tractable evidencespecific CRF 37 Evidence-specific CRF formalism Observation: identically zero feature 0 does not affect the model extra “structural” parameters P(Q | E , w, u ) ( ) ( Fixed dense model 1 exp w f (Q , E ) I ( E , u ) Z ( E , w, u ) Evidence-specific feature values E=E1 E=E2 E=E3 evidence-specific structure: I(E,u){0, 1} )×( × × × Evidence-specific tree “mask” Evidence-specific model )= ( 38 ) Evidence-specific CRF learning Learning is in the same order as testing Local conditional density estimators P(Q | E) Evidence value E=E Generative structure learning algorithm Feature weights w P(Q | E=E) Tractable structure for E=E Tractable evidencespecific CRF 39 Plug in generative structure learning 1 P(Q | E , w, u ) exp w (Q , E ) I ( E , u ) Z ( E , w, u ) encodes the output of the chosen structure learning algorithm Directly generalize generative algorithms : Generative Discriminative P(Qi,Qj) P(Qi,Qj|E=E) (pairwise marginals) + Chow-Liu algorithm = optimal tree (pairwise conditionals) + Chow-Liu algorithm = good tree for E=E 40 Evidence-specific CRF learning: structure Choose generative structure learning algorithm A Chow-Liu Identify low-dimensional subsets Qβ that A may need All pairs (Qi, Qj) E Q E Q1,Q2 E Q1,Q3 , original problem E Q3,Q4 … low-dimensional pairwise problems Pˆ (Q1 , Q2 | E , u12 ) Pˆ (Q1 , Q3 | E, u13 ) Pˆ (Q3 , Q4 | E, u34 ) 41 Estimating low-dimensional conditionals Use the same features as the baseline high-treewidth model Baseline CRF 1 P(Q, E | w) exp w Q , E Z ( E , w) Scope restriction Low-dimensional model 1 P(Q | E, u ) exp u Q , E s.t. Q Q Z ( E, u) End result: optimal u 42 Evidence-specific CRF learning: weights 1 P(Q | E , w, u ) exp w (Q , E ) I ( E , u ) Z ( E , w, u ) “effective features” Already chosen the algorithm behind I(E,u) Already learned parameters u Only need to learn feature weights w log P(Q|E,w,u) is concave in w unique global optimum 43 Evidence-specific CRF learning: weights Tree-structured distribution log P(Q | E, w, u ) I E, u f Q , E E P (Q|E, w,u ) f Q , E w ( )( Fixed dense model Evidence-specific tree “mask” ) ( Exact tree-structured gradients wrt w E=E1 Q=Q1 E=E2 Q=Q2 E=E3 Q=Q3 ) Overall gradient (dense) ( Σ 44 ) Results – WebKB Text + links webpage topic Ignore links Standard dense CRF Max-margin model 1000 0.15 better Our work 0.1 500 0.05 0 0 SVM RMN ESS-CRF M3N Prediction error RMN ESS-CRF M3N Time 45 Image segmentation - accuract local segment features + neighbor segments type of object Ignore links Standard dense CRF Our work better 0.75 0.7 0.65 0.6 Logistic Dense CRF ESS-CRF regression Accuracy 46 Image segmentation - time Ignore links Standard dense CRF better 10000 Our work 1000 1000 100 100 10 10 1 1 Logistic Dense CRF ESS-CRF regression Train time (log scale) Logistic Dense CRF ESS-CRF regression Test time (log scale) 47 Conclusions Using evidence values to tune low-treewidth model structure Compensates for the reduced expressive power Order of magnitude speedup at test time (sometimes train time too) General framework for plugging in existing generative structure learners Straightforward relational extension [in the thesis] 48 Thesis contributions Learn accurate and tractable models In the generative setting P(Q,E) [NIPS 2007] In the discriminative setting P(Q|E) [NIPS 2010] Speed up belief propagation for cases with many nuisance variables [AISTATS 2010] 49 Why high-treewidth models? A dense model expressing laws of nature Protein folding Max-margin parameters don’t work well (yet?) with evidence-specific structures 50 Query-Specific inference problem query evidence not interesting P(X ) fij ( X i X j ) ijE Using information about the query to speed up convergence of belief propagation for the query marginals 51 (loopy) Belief Propagation Passing messages along edges mi k k r h j s u i Variable belief: ~ (t ) P ( xi ) m (jt) i ( xi ) ijE Update rule: m (jt1i) ( xi ) f ij ( xi x j ) xj (t ) m k j (x j ) kjE , k i Result: all single-variable beliefs 52 (loopy) Belief Propagation Message dependencies are local: k r h j s u i dependence Freedom in scheduling updates Round–robin schedule Fix message order Apply updates in that order until convergence 53 Dynamic update prioritization large change large change small change 1 large change informative update small change 2 small change wasted computation Fixed update sequence is not the best option Dynamic update scheduling can speed up convergence Tree-Reweighted BP [Wainwright et. al., AISTATS 2003] Residual BP [Elidan et. al. UAI 2006] Residual BP apply the largest change first 54 Residual BP [Elidan et. al., UAI 2006] Update rule: new ) m(j NEW ( xi ) f ij ( xi x j ) i xj ( OLD ) m k j (x j ) old kjE , k i Pick edge with largest residual ) ( OLD ) max m (j NEW m i j i ) ( NEW ) m Update m(OLD j i j i More effort on the difficult parts of the model But no query 55 Why edge importance weights? query residual < residual which to update?? • Residual BP updates • no influence on the query • wasted computation • want to update • influence on the query in the future Our work max approx. eventual effect on P(query) Residual BP max immediate residual reduction 56 Query-Specific BP Update rule: new ) m(j NEW ( xi ) f ij ( xi x j ) i xj Pick edge with max m ( NEW ) j i Update m(OLD) m( NEW ) j i j i m ( OLD ) m k j (x j ) old kjE , k i ( OLD ) j i A j i edge importance the only change! Rest of the talk: defining and computing edge importance 57 Edge importance base case ) ( OLD ) m (j NEW m A j i i j i k r h j s u i approximate eventual update effect on P(Q) query Base case: edge directly connected to the query Aji=?? change in query belief change in message ||P(NEW)(Q) P(OLD)(Q)|| ||m(NEW) m(OLD)|| 11 ji ||P(Q)|| tight bound ji ||mji || Edge importance one step away query Edge one step away from the query: Arj=?? ||P(Q)|| k r h j s u i ||m ji || over values of all other messages change in query belief ||m || rj change in message can compute in closed form looking at only fji [Mooij, Kappen; 2007] sup mji mrj message importance Edge importance general case query k i r j s h u Base case: Aji=1 One step away: mji Arj= sup mrj ||P(Q)||||msh|| sup sup P(Q) msh sup P(Q) msh Generalization? expensive to compute bound may be infinite mhr sup mrj sup mji msh mhr mrj sensitivity(): max impact along the path Edge importance general case 2 query sup P(Q) msh i sup k 1 r j s h u mhr sup mrj sup mji msh mhr mrj sensitivity(): max impact along the path Ash = max all paths from h to query sensitivity( There are a lot of paths in a graph, trying out every one is intractable ) Efficient edge importance computation A = max all paths from to query sensitivity() There are a lot of paths in a graph, trying out every one is intractable always decreases as the path grows alwaysDijkstra’s 1 always 1paths) alg. always 1 (shortest willm efficiently findm max-sensitivity paths hr sup rj sup m ji sup sensitivity( hrji ) = for every m medge m sh hr rj decomposes into individual edge contributions 62 Query-Specific BP Run Dijkstra’s alg starting at query to get edge weights Aji = max all paths from i to query sensitivity() Pick edge with largest weighted residual More effort on the difficult and relevant parts of the model ) ( OLD ) max m (j NEW m A j i i j i Takes into account not only graphical structure, ) ( NEW ) Update m(OLD m j i j i but also strength of dependencies 63 Experiments – single query Easy model Hard model (sparse connectivity, weak interactions) (dense connectivity, strong interactions) Our work better Standard residual BP Faster convergence, but long initialization still a problem 64 Anytime query-specific BP query k r j s i Query-specific BP: Dijkstra’s alg. BP updates same BP update sequence! Anytime QSBP: 65 Experiments – anytime QSBP Easy model Hard model (sparse connectivity, weak interactions) (dense connectivity, strong interactions) Our work Our work + anytime better Standard residual BP Much shorter initialization 66 Experiments – multiquery Easy model Hard model (sparse connectivity, weak interactions) (dense connectivity, strong interactions) Our work Our work + anytime better Standard residual BP 67 Conclusions Weighting edges is a simple and effective way to improve prioritization We introduce a principled notion of edge importance based on both structure and parameters of the model Robust speedups in the query-specific setting Don’t spend computation on nuisance variables unless needed for the query marginal Deferring BP initialization has a large impact 68 Thesis contributions Learn accurate and tractable models In the generative setting P(Q,E) [NIPS 2007] In the discriminative setting P(Q|E) [NIPS 2010] Speed up belief propagation for cases with many nuisance variables [AISTATS 2010] 69 Future work More practical JT learning SAT solvers to construct structure, pruning heuristics, … Evidence-specific learning Trade efficiency for accuracy Max-margin evidence-specific models Theory on ES structures too Inference: Beyond query-specific: better prioritization in general Beyond BP: query-specific Gibbs sampling? 70 Thesis conclusions Graphical models are a regularization technique for high-dimensional distributions Representation-based structure is well-understood Conditional independencies Right now, structured computation is a “consequence” of representation Major issues with tractability, approximation quality Logical next step structured computation as a primary basis of regularization This thesis: computation-centric approaches have better efficiency and do not sacrifice accuracy 71 Thank you! Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf 72 Mutual info upper bound: quality Upper bound: Suppose an -JT exists is the largest mutual information over small subsets Then I(A, B | C) |ABC| ( + ) No need to know the -JT, only that it exists No connection between C and the JT separators C can be of any size, no connection to JT treewidth The bound is loose only when there is no hope to learn a good JT 73 Typical graphical models workflow reasonable intractable structure from domain knowledge Learn/construct structure Learn/define parameters The graph is primarily a representation tool approximate algs: no quality guarantees Inference P(Q|E=e) approx. P(Q|E=e) 74 Contributions – tractable models Learn accurate and tractable models In the generative setting [NIPS 2007] Polynomial-time conditional mutual information upper bound First PAC-learning result for strongly connected junction trees Graceful degradation guarantees Speedup heuristrics In the discriminative setting [NIPS 2010] General framework for learning CRF structure that depends on evidence values at test time Extensions to the relational setting Empirical: order of magnitude speedups with the same accuracy as high-treewidth models 75 Contributions – faster inference Speed up belief propagation for cases with many nuisance variables [AISTATS 2010] A framework of importance-weighted residual belief propagation A principled measure of eventual impact of an edge update on the query belief Prioritize updates by importance for the query instead of absolute magnitude An anytime modification to defer much of initialization Initial inference results available much sooner Often much faster eventual convergence The same fixed points as the full model 76 Future work Two main bottlenecks: Constructing JTs given mutual information values. Esp. with non-uniform treewidth, dependence strength Large sample: learnability guarantees for non-uniform treewidth Small sample: non-uniform treewidth for regularization Constraint satisfaction, SAT solvers, etc? Relax strong connectivity requirement? Evaluating mutual information: need to look at 2k+1 variables instead of k+1, large penalty Branch on features instead of sets of variables? [Gogate+al:2010] Speedups without guarantees Local search, greedy separator construction, … 77 Log-linear parameter learning conditional log-likelihood LLH (D|w) log P(Q | E,w) ( Q ,E )D Convex optimization: unique global maximum Gradient: features – [expected features] log P(Q | E, w) f Q , E E P (Q|E, w) f Q , E w need inference inference for every E given w 78 Log-linear parameter learning Tractable Intractable Generative (E=) Discriminative Closed-form Exact gradient-based Approximate gradient-based (no guarantees) Approximate gradient-based (no guarantees) Complexity “phase transition” “manageable” slowdown by the number of datapoints Inference once per weights update Inference for every datapoint (Q,E) once per weights update 79 Plug in generative structure learning 1 P(Q | E , w, u ) exp w (Q , E ) I ( E , u ) Z ( E , w, u ) encodes the output of the chosen structure learning algorithm Fix algorithm always get structures with desired properties (e.g. treewidth): Chow-Liu for optimal trees Our thin junction tree learning from part 1 Karger-Srebro for high-quality low-diameter junction trees Local search, etc … replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere 80 Evidence-specific CRF learning: weights 1 P(Q | E , w, u ) exp w (Q , E ) I ( E , u ) Z ( E , w, u ) Already know algorithm behind I(E,u) Already learned u Only need to learn w Can find evidence-specific structure I(E=E,u) for every training datapoint (Q,E) Structure induced by I(E,u) is always tractable Learn optimal w exactly Tree-structured distribution log P(Q | E, w, u ) I E, u f Q , E E P (Q|E, w,u ) f Q , E w 81 Relational evidence-specific CRF Relational models: templated features + shared weights Relation: webpage LinksTo webpage Learn a single weight wLINK Copy weight for every grounding Groundings: LinksTo wLINK LinksTo wLINK 82 Relational evidence-specific CRF Relational models: templated features + shared weights Every grounding is a separate datapoint for structure training use propositional approach + shared weights Grounded model Training datasets for “structural” parameters u x3 x1 x2 x4 x5 x3 x4 x3 x5 x4 x5 x1 x2 x 1 x3 x 1 x4 x 1 x5 x 2 x3 x 2 x4 x 2 x5 83 Future work Faster learning: pseudolikelihood is really fast, need to compete Larger treewidth: trade time for accuracy Theory on learning “structural parameters” u Max-margin learning Inference is basic step in max-margin learning too tractable models are useful beyond log-likelihood Optimizing feature weights w given local trees is straightforward Optimizing “structural parameters” u for max-margin is hard What is the right objective? Almost tractable structures, other tractable models Make sure loops don’t hurt too much 84 Query versus nuisance variables We may actually care about only few variables What are the topics of the webpages on the first page of Google search results for my query? Smart heating control: is anybody going to be at home for the next hour? Does the patient need immediate doctor attention? But the model may have a lot of other variables to be accurate enough Don’t care about them per se, but necessary to look at to get the query right Both query and nuisance variables are unknown, inference algorithms don’t see a difference Speed up inference by focusing on the query Only look at nuisance variable to the extent needed to answer the query 85 Our contributions Using weighted residuals to prioritize updates Define message weights reflecting the importance of the message to the query Computing importance weights efficiently Experiments: faster convergence on large relational models 86 Interleaving Dijkstra’s expands the highest weight edges first query expanded on previous iteration min not yet expanded just expanded expanded edges A A ( NEW ) ( OLD ) M max m m suppose j iALL EDGES j i j i actual priority of upper bound on ) ( OLD ) max j iEXPANDED m (j NEW m A j i M min i j i priority expanded A no need to expand further at this point 87 Deferring BP initialization Observation: Dijkstra’s alg. expands the most important edges first Do we really need to look at every low importance edge before applying BP updates? No! Can use upper bounds on priority instead. 88 Upper bounds in priority queue Observation: for edges low in the priority queue, an upper bound on the priority is enough Updates priority queue k r j s Exact priority needed for top element i Priority upper bound is enough here 89 Priority upper bound for not yet seen edges Expand several edges with Dijkstra’s : For : (residual) (weight) = (exact priority) k r j s i For all the other edges… priority( ) = residual( ) importance weight( ) priority( ) || factor( ) || importance weight( s.t. ) is already expanded Component-wise upper bound without looking at the edge! 90 Interleaving BP and Dijkstra’s query full model Dijkstra Dijkstra > upper bound BP update exact priority < upper bound Dijkstra expand an edge exact priority BP Dijkstra BP BP … 91