Introduction to Graphical Models for Data Mining Arindam Banerjee banerjee@cs.umn.edu Dept of Computer Science & Engineering University of Minnesota, Twin Cities 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining July 25, 2010 Introduction • Graphical Models – Brief Overview • Part I: Tree Structured Graphical Models – Exact Inference • Part II: Mixed Membership Models – Latent Dirichlet Allocation – Generalizations, Applications • Part III: Graphical Models for Matrix Analysis – Probabilistic Matrix Factorization – Probabilistic Co-clustering – Stochastic Block Models Graphical Models 2 Graphical Models: What and Why • Statistical Data Analaysis – Build diagnostic/predictive models from data – Uncertainty quantification based on (minimal) assumptions • The I.I.D. assumption – Data is independently and identically distributed – Example: Words in a doc drawn i.i.d. from the dictionary • Graphical models – Assume (graphical) dependencies between (random) variables – Closer to reality, domain knowledge can be captured – Learning/inference is much more difficult Graphical Models 3 Flavors of Graphical Models • Basic nomenclature – Node = random variable, maybe observed/hidden – Edge = statistical dependency • Two popular flavors: ‘Directed’ and ‘Undirected’ • Directed Graphs – A directed graph between random variables, causal dependencies X1 – Example: Bayesian networks, Hidden Markov Models X3 – Joint distribution is a product of P(child|parents) • Undirected Graphs X4 X2 X5 – An undirected graph between random variables – Example: Markov/Conditional random fields – Joint distribution in terms of potential functions Graphical Models 4 Bayesian Networks X2 X1 X3 X4 X5 • Joint distribution in terms of P(X|Parents(X)) Graphical Models 5 Example I: Burglary Network This and several other examples are from the Russell-Norvig AI book Graphical Models 6 Computing Probabilities of Events • Probability of any event can be computed: P(B,E,A,J,M) = P(B) P(E|B) P(A|B,E) P(J|B,E,A) P(M|B,E,A,J) = P(B) P(E) P(A|B,E) P(J|A) P(M|A) • Example: P(b,¬e,a, ¬j,m) = P(b) P(¬e)P(a|b,¬e) P(¬j|a) P(m|a) Graphical Models 7 Example II: Rain Network Graphical Models 8 Example III: “Car Won’t Start” Diagnosis Graphical Models 9 Inference • Some variables in the Bayes net are observed – the evidence/data, e.g., John has not called, Mary has called • Inference – How to compute value/probability of other variables – Example: What is the probability of Burglary, i.e., P(b|¬j,m) Graphical Models 10 Inference Algorithms • Graphs without loops: Tree-structured Graphs – Efficient exact inference algorithms are possible – Sum-product algorithm, and its special cases • Belief propagation in Bayes nets • Forward-Backward algorithm in Hidden Markov Models (HMMs) • Graphs with loops – Junction tree algorithms • Convert into a graph without loops • May lead to exponentially large graph – Sum-product/message passing algorithm, ‘disregarding loops’ • Active research topic, correct convergence ‘not guaranteed’ • Works well in practice – Approximate inference Graphical Models 11 Approximate Inference • Variational Inference – Deterministic approximation – Approximate complex true distribution over latent variables – Replace with family of simple/tractable distributions • Use the best approximation in the family – Examples: Mean-field, Bethe, Kikuchi, Expectation Propagation • Stochastic Inference – Simple sampling approaches – Markov Chain Monte Carlo methods (MCMC) • Powerful family of methods – Gibbs sampling • Useful special case of MCMC methods Graphical Models 12 Part I: Tree Structured Graphical Models • The Inference Problem • Factor Graphs and the Sum-Product Algorithm • Example: Hidden Markov Models • Generalizations Graphical Models 13 The Inference Problem Graphical Models 14 Complexity of Naïve Inference Graphical Models 15 Bayes Nets to Factor Graphs Graphical Models 16 Factor Graphs: Product of Local Functions Graphical Models 17 Marginalize Product of Functions (MPF) • Marginalize product of functions • Computing marginal functions • The “not-sum” notation Graphical Models 18 MPF using Distributive Law • We focus on two examples: g1(x1) and g3(x3) • Main Idea: Distributive law ab + ac = a(b+c) • For g1(x1), we have • For g3(x3), we have Graphical Models 19 Computing Single Marginals • Main Idea: – Target node becomes the root – Pass messages from leaves up to the root Graphical Models 20 Message Passing Compute product of descendants Compute product of descendants with f Then do not-sum over part Graphical Models 21 Example: Computing g1(x1) Graphical Models 22 Example: Computing g3(x3) Graphical Models 23 Hidden Markov Models (HMMs) Latent variables: z0,z1,…,zt-1,zt,zt+1,…,zT Observed variables: x1,…,xt-1,xt,xt+1,…,xT Inference Problems: 1. Compute p(x1:T) 2. Compute p(zt|x1:T) 3. Find maxz1:T p(z1:T|x1:T) Similar problem for chain-structured Conditional Random Fields (CRFs) Graphical Models 24 The Sum-Product Algorithm • To compute gi(xi), form a tree rooted at xi • Starting from the leaves, apply the following two rules – Product Rule: At a variable node, take the product of descendants – Sum-product Rule: At a factor node, take the product of f with descendants; then perform not-sum over the parent node • To compute all marginals – Can be done one at a time; repeated computations, not efficient – Simultaneous message passing following the sum-product algorithm – Examples: Belief Propagation, Forward-Backward algorithm, etc. Graphical Models 25 Sum-Product Updates Graphical Models 26 Sum-Product Updates Graphical Models 27 Example: Step 1 Graphical Models 28 Example: Step 2 Graphical Models 29 Example: Step 3 Graphical Models 30 Example: Step 4 Graphical Models 31 Example: Step 5 Graphical Models 32 Example: Termination Graphical Models 33 HMMs Revisited Latent variables: z0,z1,…,zt-1,zt,zt+1,…,zT Observed variables: x1,…,xt-1,xt,xt+1,…,xT Inference Problem: 1. Compute p(x1:T) 2. Compute p(zt|x1:T) Sum-product algorithm is known as the `forward-backward’ algorithm Smoothing in Kalman Filtering Graphical Models 34 Distributive Law on Semi-Rings • Idea can be applied to any commutative semi-ring • Semi-ring 101 – Two operations (+,×): Associative, Commutative, Identity – Distributive law: a×b + a×c = a×(b+c) •Belief Propagation in Bayes nets •MAP inference in HMMs •Max-product algorithm •Alternative to Viterbi Decoding •Kalman Filtering •Error Correcting Codes •Turbo Codes •… Graphical Models 35 Message Passing in General Graphs • Tree structured graphs – Message passing is guaranteed to give correct solutions – Examples: HMMs, Kalman Filters • General Graphs – Active research topic • Progress has been made in the past 10 years – Message passing • May not converge • May converge to a ‘local minima’ of ‘Bethe variational free energy’ – New approaches to convergent and correct message passing • Applications – True Skill: Ranking System for Xbox Live – Turbo Codes: 3G, 4G phones, satellite comm, Wimax, Mars orbiter Graphical Models 36 Part II: Mixed Membership Models • Mixture Models vs Mixed Membership Models • Latent Dirichlet Allocation • Inference – Mean-Field and Collapsed Variational Inference – MCMC/Gibbs Sampling • Applications • Generalizations Graphical Models 37 Background: Plate Diagrams a a b b1 b2 b3 3 Compact representation of large Bayesian networks Graphical Models 38 Model 1: Independent Features x 0.8 0.7 0.6 0.5 0.4 0.3 0.3 x 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 d 0.6 0.5 n 1 0.4 0.3 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 θ 0.7 0.6 0.5 d -2 0.4 0.3 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 d=3, n=1 Graphical Models 39 Model 2: Naïve Bayes (Mixture Models) π z x x d d n θ d n θ d k Graphical Models 40 Naïve Bayes Model k d θ n d z π 0.8 0.8 0.7 0.7 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 3 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -8 -6 -4 -2 0 2 4 6 8 0 -10 10 0.8 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 -0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 -10 0 -10 10 0.8 0 -10 x -1 0.6 0.6 0 -10 x 0.1 -8 -6 -4 -2 0 2 4 6 8 10 Graphical Models 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 41 Naïve Bayes Model k d θ n d z π 0.8 0.8 0.7 0.7 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 2.1 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -8 -6 -4 -2 0 2 4 6 8 0 -10 10 0.8 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 -1.5 0.4 0.3 0.3 0.2 0.2 0.1 0 -10 0 -10 10 0.8 0 -10 x 0.1 0.6 0.6 0 -10 x 0.1 -8 -6 -4 -2 0 2 4 6 8 10 Graphical Models 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 42 Model 3: Mixed Membership Model α x d θ d π π z z x x d n θ d n dk Graphical Models n θ d k 43 Mixed Membership Models n d α π z x k d θ 0.8 0.8 0.7 0.7 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 3.1 0.4 0.3 0.3 0.2 0.2 0.1 0 -10 x 0.7 0.6 0.6 0.1 -8 -6 -4 -2 0 2 4 6 8 10 0 -10 0.8 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 0.5 -1 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 -10 0 -10 -8 -6 -4 -2 0 2 Graphical Models 4 6 8 -8 -6 -4 -2 0 2 4 6 8 10 10 44 Mixed Membership Models n d α π z x k d θ 0.8 0.8 0.7 0.7 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 -8 -6 -4 -2 0 2 4 6 8 10 0.7 0.6 0.6 0.5 0.5 0.4 2.1 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 0.8 0.7 0.7 0.6 0.6 0.5 -2 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0 -10 0 -10 0.8 0.7 0 -10 x 0.9 0.6 0.6 0.1 -8 -6 -4 -2 0 2 Graphical Models 4 6 8 10 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 45 Mixture Model vs Mixed Membership Model x x -1 0.1 3 2.1 -0.5 -1.5 Single component membership x 0.7 x 0.9 3.1 2.1 -1 -2 Multi-component mixed membership Graphical Models 46 Latent Dirichlet Allocation (LDA) distribution over topics for each document p(d) Dirichlet() Dirichlet priors p (d) distribution over words for each topic K bj topic assignment for each word zi Discrete(p (d) ) word generated from assigned topic xi Discrete(b (zi) ) zi xi Nd D Graphical Models 47 z z~Discrete(p) x1 x2 x3 b z1 z2 z3 x1 x2 x3 z1 z2 z3 x1 x2 x3 b a b p p p1 p2 p~Drichlet(α) z1 z2 z3 z11 z12 z13 z21 z22 x1 x2 x3 x11 x12 x13 x21 x22 b Graphical Models b 48 LDA Generative Model document1 document2 Graphical Models 49 LDA Generative Model document1 Graphical Models document2 50 Learning: Inference and Estimation Graphical Models 51 Variational Inference Graphical Models 52 Variational EM for LDA Graphical Models 53 E-step: Variational Distribution and Updates Graphical Models 54 M-step: Parameter Estimation Graphical Models 55 Results: Topics Inferred Graphical Models 56 Results: Perplexity Comparison Graphical Models 57 Aviation Safety Reports (NASA) Graphical Models 60 Results: NASA Reports I Arrival Departure Passenger Maintenance runway approach departure altitude turn tower air traffic control heading taxi way flight passenger attendant flight seat medical captain attendants lavatory told police maintenance engine mel zzz air craft installed check inspection fuel Work Graphical Models 61 Results: NASA Reports II Medical Emergency Wheel Maintenance Weather Condition Departure medical passenger doctor attendant oxygen emergency paramedics flight nurse aed tire wheel assembly nut spacer main axle bolt missing tires knots turbulence aircraft degrees ice winds wind speed air speed conditions departure sid dme altitude climbing mean sea level heading procedure turn degree Graphical Models 62 Two-Dimensional Visualization for Reports The pilot flies an owner's airplane with the owner as a passenger. Loses contact with the center during the flight. While performing a sky diving, a jet approaches at the same altitude, but an accident is avoided. Red: Flight Crew Blue: Passenger Graphical Models Green: Maintenance 63 Two-Dimensional Visualization for Reports Altimeter has a problem, but the pilot overcomes the difficulty during the flight. During acceleration, a flap retraction issue happens. The pilot then returns to base and lands. The mechanic finds out the problem. Red: Flight Crew Blue: Passenger Graphical Models Green: Maintenance 64 Two-Dimensional Visualization for Reports The captain has a medical emergency. The pilot has a landing gear problem. Maintenance crew joins radio conversation to help. Red: Flight crew Blue: Passenger Graphical Models Green: Maintenance 65 Mixed Membership of Reports Flight Crew: 0.7039 Passenger: 0.0009 Maintenance: 0.2953 Flight Crew: 0.2563 Passenger: 0.6599 Maintenance: 0.0837 Flight Crew: 0.1405 Passenger: 0.0663 Maintenance: 0.7932 Red: Flight Crew Flight Crew: 0.0013 Passenger: 0.0013 Maintenance: 0.9973 Blue: Passenger Graphical Models Green: Maintenance 66 Smoothed Latent Dirichlet Allocation distribution over topics for each document p(d) Dirichlet() Dirichlet priors distribution over words for each topic (j) Dirichlet(b) b p (d) (j) zi topic assignment for each word zi Discrete(p (d) ) word generated from assigned topic xi Discrete( (zi) ) T xi Nd D Graphical Models 67 Stochastic Inference using Markov Chains • Powerful family of approximate inference methods – Markov Chain Monte Carlo, Gibbs Sampling • The basic idea – Need to marginalize over complex latent variable distribution p(x|q) = ∫z p(x,z|q) = ∫z p(x|q) p(z|x,q) = Ez~p(z|x,q)[p(x|q)] – Draw ‘independent’ samples from p(z|x,q) – Compute sample based average instead of the full integral • Main Issue: How to draw samples? – – – – Difficult to directly draw samples from p(z|x,q) Construct a Markov chain whose stationary distribution is p(z|x,q) Run chain till ‘convergence’ Obtain samples from p(z|x,q) Graphical Models 68 The Metropolis-Hastings Algorithm Graphical Models 69 The Metropolis-Hastings Algorithm (Contd) Graphical Models 70 The Gibbs Sampler Graphical Models 71 Collapsed Gibbs Sampling for LDA Graphical Models 72 Collapsed Variational Inference for LDA Graphical Models 73 Collapsed Variational Inference for LDA Graphical Models 74 Results: Comparison of Inference Methods Graphical Models 75 Results: Comparison of Inference Methods Graphical Models 76 Generalizations • Generalized Topic Models – Correlated Topic Models – Dynamic Topic Models, Topics over Time – Dynamic Topics with birth/death • Mixed membership models over non-text data, applications – Mixed membership naïve-Bayes – Discriminative models for classification – Cluster Ensembles • Nonparametric Priors – Dirichlet Process priors: Infer number of topics – Hierarchical Dirichlet processes: Infer hierarchical structures – Several other priors: Pachinko allocation, Gaussian Processes, IBP, etc. Graphical Models 77 CTM Results Graphical Models 78 DTM Results Graphical Models 79 DTM Results II Graphical Models 80 Mixed Membership Naïve Bayes • For each data point, – Choose π ~ Dirichlet() • For each of observed features fn: – Choose a class zn ~ Discrete (π) – Choose a feature value xn from p(xn|zn,fn,Θ), which could be Gaussian, Poisson, Bernoulli… Graphical Models 81/41 MMNB vs NB: Perplexity Surfaces NB MMNB NB MMNB •MMNB typically achieves a lower perplexity than NB •On test set, NB shows overfitting, but MMNB is stable and robust. Graphical Models NB MMNB 82 Discriminative Mixed Membership Models Graphical Models 83 Results: DLDA for text classification Generally, Fast DLDA has a higher accuracy on most of the datasets Graphical Models 84 Topics from DLDA cabin flight ice aircraft flight descent hours aircraft gate smoke pressurization time flight ramp cabin emergency crew wing wing passenger flight day captain taxi aircraft aircraft duty icing stop captain pressure rest engine ground cockpit oxygen trip anti parking attendant atc zzz time area smell masks minutes maintenance line emergency Graphical Models 85 Cluster Ensembles • Combining multiple base clusterings of a dataset base clustering1 base clustering2 base clustering3 • Robust and stable • Distributed and scalable • Knowledge reuse, privacy preserving Graphical Models 86 Problem Formulation • Input & Output Data points Base clusterings Graphical Models Consensus clustering 87 Results: State-of-the-art vs Bayesian Ensembles Graphical Models 88 Part III: Graphical Models for Matrix Analysis • Probabilistic Matrix Factorizations • Probabilistic Co-clustering • Stochastic Block Structures Graphical Models 89 Matrix Factorization • Singular value decomposition ≈ • Problems – Large matrices, with millions of row/colums • SVD can be rather slow – Sparse matrices, most entries are missing • Traditional approaches cannot handle missing entries Graphical Models 90 Matrix Factorization: “Funk SVD” vj ^ Xij = uiTvj = u iT ^ error = (Xij –Xij)2 = (Xij –uiTvj)2 • Model X ϵ Rn×m as UVT where – U is a Rn×k, V is Rm×k – Alternatively optimize U and V Graphical Models 91 Matrix Factorization (Contd) vj ^ Xij = uiTvj = u iT ^ error = (Xij -Xij )2 • Gradient descent updates ^ uik(t+1) = uik(t) + η (Xij-Xij) vjk(t) ^ vjk(t+1) = vjk(t) + η (Xij-Xij) ujk(t) Graphical Models 92 Probabilistic Matrix Factorization (PMF) N(0, σv2I) vj Xij ~ N(uiTvj , σ2) N(0, σu2I) uiT ~ N(0, σu2I) vj ~ N(0, σv2I) Rij ~ N(uiTvj , σ2) u iT Inference using gradient descent Graphical Models 93 Bayesian Probabilistic Matrix Factorization N(µv, Λv) vj Xij ~ N(uiTvj , σ2) µu ~ N(µ0, Λ u), Λ u ~ W(ν0, W0) µv ~ N(µ0, Λ v), Λ v ~ W(ν0, W0) ui ~ N(µu, Λ u) vj ~ N(µv, Λ v) Rij ~ N(uiTvj , σ2) Wishart N(µu, Λu) u iT Gaussian Graphical Models Inference using MCMC 94 Results: PMF on the Netflix Dataset Graphical Models 95 Results: PMF on the Netflix Dataset Graphical Models 96 Results: Bayesian PMF on Netflix Graphical Models 97 Results: Bayesian PMF on Netflix Graphical Models 98 Results: Bayesian PMF on Netflix Graphical Models 99 Co-clustering: Gene Expression Analysis Original Co-clustered Graphical Models 100 Co-clustering and Matrix Approximation Graphical Models 101 Probabilistic Co-clustering … Row clusters: Column clusters: … Graphical Models 102 Probabilistic Co-clustering Graphical Models 103 Generative Process • • Assume a mixed membership for each row and column Assume a Gaussian for each cocluster 1. Pick row/column clusters 2 2. Generate each entry of the matrix Graphical Models 104 Reduction to Mixture Models 3 Graphical Models 105 Reduction to Mixture Models 3 1.1 Graphical Models 106 Generative Process • • Assume a mixed membership for each row and column Assume a Gaussian for each cocluster 1. Pick row/column clusters 2 2. Generate each entry of the matrix Graphical Models 107 Bayesian Co-clustering (BCC) • A Dirichlet distribution over all possible mixed memberships 2 Graphical Models 108 Bayesian Co-clustering (BCC) Graphical Models 109 Learning: Inference and Estimation • Learning – Estimate model parameters (1 , 2 ,q ) – Infer ‘mixed memberships’ of individual rows and columns • Expectation Maximization • Issues – Posterior probability cannot be obtained in closed form – Parameter estimation cannot be done directly • Approach: Approximate inference – Variational Inference – Collapsed Gibbs Sampling, Collapsed Variational Inference Graphical Models 110 Variational EM • Introduce a variational distribution to approximate • Use Jensen’s inequality to get a tractable lower bound • Maximize the lower bound w.r.t – Alternatively minimize the KL divergence between and • Maximize the lower bound w.r.t. Graphical Models 111 Variational Distribution • for each row, Graphical Models for each column 112 Collapsed Inference • Latent distribution can be exactly marginalized over (p1, p2) – Obtain p(X,z1,z2|1, 2,b) in closed form – Analysis assumes discrete/categorical entries – Can be generalized to exponential family distributions • Collapsed Gibbs Sampling – Conditional distribution of (z1uv,z2uv) in closed form P(z1uv=i, z2uv=j | X, z1-uv, z2-uv, 1, 2, b) – Sample states, run sampler till convergence • Collapsed Variational Bayes – Variational distribution q(z1,z2|g) = ∏u,v q(z1uv,z2uv|guv) – Gaussian and Taylor approximation to obtain updates for guv Graphical Models 113 Residual Bayesian Co-clustering (RBC) •(z1,z2) determines the distribution •(m1,m2): row/column means •Users/movies may have bias •(bm1,bm2): row/ column bias Graphical Models 114 Results: Datasets • Movielens: Movie recommendation data – 100,000 ratings (1-5) for 1682 movies from 943 users (6.3%) – Binarize: 0 (1-3), 1(4-5). – Discrete (original), Bernoulli (binary), Real (z-scored) • Foodmart: Transaction data – 164,558 sales records for 7803 customers and 1559 products (1.35%) – Binarize: 0 (less than median), 1(higher than median) – Poisson (original), Bernoulli (binary), Real (z-scored) • Jester: Joke rating data – 100,000 ratings (-10.00,+10.00) for 100 jokes from 1000 users (100%) – Binarize: 0 (lower than 0), 1 (higher than 0) – Gaussian (original), Bernoulli (binary), Real (z-scored) Graphical Models 115 Perplexity Comparison with 10 Clusters Training Set Test Set MMNB BCC LDA MMNB BCC LDA Jester 1.7883 1.8186 98.3742 Jester 4.0237 2.5498 98.9964 Movielens 1.6994 1.9831 439.6361 Movielens 3.9320 2.8620 1557.0032 Foodmart 1.8691 1.9545 1461.7463 Foodmart 6.4751 2.1143 6542.9920 On Binary Data Training Set Test Set MMNB BCC MMNB BCC Jester 15.4620 18.2495 Jester 39.9395 24.8239 Movielens 3.1495 0.8068 Movielens 38.2377 1.0265 Foodmart 4.5901 4.5938 Foodmart 4.6681 4.5964 On Original Data Graphical Models 116 Co-embedding: Users Graphical Models 117 Co-embedding: Movies Graphical Models 118 RBC vs. other co-clustering algorithms •RBC and RBC-FF perform better than BCC •RBC and RBC-FF are also the best among others Jester Graphical Models 119 RBC vs. other co-clustering algorithms Movielens Foodmart Graphical Models 120 RBC vs. SVD, NNMF, and CORR •RBC and RBC-FF are competitive with other algorithms Jester Graphical Models 121 RBC vs. SVD, NNMF and CORR Movielens Foodmart Graphical Models 122 SVD vs. Parallel RBC Parallel RBC scales well to large matrices Graphical Models 123 Inference Methods: VB, CVB, Gibbs Graphical Models 124 Mixed Membership Stochastic Block Models • Network data analysis – Relational View: Rows and Columns are the same entity – Example: Social networks, Biological networks – Graph View: (Binary) adjacency matrix • Model Graphical Models 125 MMB Graphical Model Graphical Models 126 Variational Inference • Variational lower bound • Fully factorized variational distribution • Variational EM – E-step: Update variational parameters (g,j) – M-step: Update model parameters (,B) Graphical Models 127 Results: Inferring Communities Original friendship matrix Friendships inferred from the posterior, respectively based on thresholding ppTBpq and jpTBjq Graphical Models 128 Results: Protein Interaction Analysis “Ground truth”: MIPS collection of protein interactions (yellow diamond) Comparison with other models based on protein interactions and microarray expression analysis Graphical Models 129 Non-parametric Bayes Dirichlet Process Mixtures Gaussian Processes Hierarchical Dirichlet Processes Chinese Restaurant Processes Pittman-Yor Processes Mondrain Processes Indian Buffet Processes Graphical Models 130 References: Graphical Models • S. Russell & P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 2009. • D. Koller & N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009. • C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007. • D. Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press, 2010. • M. I. Jordan (Ed), Learning in Graphical Models, MIT Press, 1998. • S. L. Lauritzen, Graphical Models, Oxford University Press, 1996. • J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988. Graphical Models 131 References: Inference • F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the • sum-product algorithm,” IEEE Transactions on Information Theory, vol.47, no. 2, 498–519, 2001. • S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Transactions on Information Theory, 46, 325–343, 2000. • M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, 1-305, December 2008. • C. Andrieu, N. De Freitas, A. Doucet, M. I. Jordan, “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5-43, 2003. • J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, 2005. Graphical Models 132 References: Mixed-Membership Models • • • • • • • S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. “Indexing by latent semantic analysis,” Journal of the Society for Information Science, 41(6):391–407, 1990. T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42(1):177–196, 2001. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research (JMLR), 3:993–1022, 2003. T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, 101(Suppl 1): 5228–5235, 2004. Y. W. Teh, D. Newman, and M. Welling. “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, ” Neural Information Processing Systems (NIPS), 2007. A. Asuncion, P. Smyth, M. Welling, Y.W. Teh, “On Smoothing and Inference for Topic Models,” Uncertainty in Artificial Intelligence (UAI), 2009. H. Shan, A. Banerjee, and N. Oza, “Discriminative Mixed-membership Models,” IEEE Conference on Data Mining (ICDM), 2009. Graphical Models 133 References: Matrix Factorization • • • • • • S. Funk, “Netflix update: Try this at home,” http://sifter.org/~simon/journal/20061211.html R. Salakhutdinov and A. Mnih. “Probabilistic matrix factorization,” Neural Information Processing Systems (NIPS), 2008. R. Salakhutdinov and A. Mnih. “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” International Conference on Machine Learning (ICML), 2008. I. Porteous, A. Asuncion, and M. Welling, “Bayesian matrix factorization with side information and Dirichlet process mixtures,” Conference on Artificial Intelligence (AAAI), 2010. I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. “Modelling relational data using Bayesian clustered tensor facotrization,” Neural Information Processing Systems (NIPS), 2009. A. Singh and G. Gordon, “A Bayesian matrix factorization model for relational data,” Uncertainty in Artificial Intelligence (UAI), 2010. Graphical Models 134 References: Co-clustering, Block Structures • • • • • • • A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, D. Modha., “A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation,” Journal of Machine Learning Research (JMLR), 2007. M. M. Shafiei and E. E. Milios, “Latent Dirichlet Co-Clustering,” IEEE Conference on Data Mining (ICDM), 2006. H. Shan and A. Banerjee, “Bayesian co-clustering,” IEEE International Conference on Data Mining (ICDM), 2008. P. Wang, C. Domeniconi, and K. B. Laskey, “Latent Dirichlet Bayesian CoClustering,” European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2009. H. Shan and A. Banerjee, “Residual Bayesian Co-clustering for Matrix Approximation,” SIAM International Conference on Data Mining (SDM), 2010. T. A. B. Snijders and K. Nowicki, “Estimation and prediction for stochastic blockmodels for graphs with latent block structure,” Journal of Classification, 14:75–100, 1997. E.M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed-membership stochastic blockmodels,” Journal of Machine Learning Research (JMLR), 9, Graphical Models 135 1981-2014, 2008. Acknowledgements Hanhuai Shan Amrudin Agovic Graphical Models 136 Graphical Models 137