Learning from data with graphical models Padhraic Smyth © Information and Computer Science University of California, Irvine www.datalab.uci.edu © P. Smyth: UC Irvine: Graphical Models: 2004: 1 Outline • Graphical models – A general language for specification and computation with probabilistic models • Learning graphical models from data – EM algorithm, Bayesian learning, etc • Applications – Learning topics from documents – Discovering clusters of curves – Other work… © P. Smyth: UC Irvine: Graphical Models: 2004: 2 The Data Revolution • Technological Advances in past 20 years – – – – – Sensors Storage Computational power Databases and indexing Machine learning • Science and engineering are increasingly data driven © P. Smyth: UC Irvine: Graphical Models: 2004: 3 Examples of Digital Data Sets • The Web: > 4.3 billion pages – Link graph – Text on Web pages • Genomic Data – Sequence = 3 billion base pairs (human) – 25k genes: expression, location, networks…. • Sloan Digital Sky Survey – 15 terabytes – ~500 million sky objects • Earth Sciences – NASA MODIS satellite – Entire Earth at 15m to 250m resolution every day – 37 spectral bands © P. Smyth: UC Irvine: Graphical Models: 2004: 4 © P. Smyth: UC Irvine: Graphical Models: 2004: 5 © P. Smyth: UC Irvine: Graphical Models: 2004: 6 © P. Smyth: UC Irvine: Graphical Models: 2004: 7 © P. Smyth: UC Irvine: Graphical Models: 2004: 8 © P. Smyth: UC Irvine: Graphical Models: 2004: 10 Questions of Interest • can we detect significant change? • identify, classify, and catalog certain phenomena? • segment/cluster pixels into land cover types? • measure and predict seasonal variation? • And so on…. Previously • scientists did much of this work by hand But now…. • automated algorithms are essential © P. Smyth: UC Irvine: Graphical Models: 2004: 13 Learning from Data • Uncertainty abounds…. – – – – Measurement error Unobserved phenomena Model and parameter uncertainty Forecasting and prediction • Probability is the “ language of uncertainty” – Significant shift in computer science towards probabilistic models in recent years © P. Smyth: UC Irvine: Graphical Models: 2004: 14 Preliminaries • Probability • X = x : statement about the world (true or false) • P(X=x) : degree of belief that the world is in state X=x • Set of variables U = {X1, X2, ...... XN} • Joint Probability Distribution: P(U) = p (X1, X2, ...... XN ) P(U) is a table of K x K x K …. = KN numbers (say all variables take K values) © P. Smyth: UC Irvine: Graphical Models: 2004: 15 Conditional Probabilities • Many problems of interest involve computing conditional probabilities – Prediction P(XN, | xN,...... x1) – Classification/Diagnosis/Detection arg max { P(Y | x1,...... xN) } – Learning P( q | x1,...... xN) • Note: – Computing P(XN+1 | X1 = x) has time complexity O(KN) © P. Smyth: UC Irvine: Graphical Models: 2004: 16 Two Problems • Problem 1: Computational Complexity – Inference computations scale as O(KN) • Problem 2: Model Specification – To specify p(U) we need a table of KN numbers – Where do these numbers come from? © P. Smyth: UC Irvine: Graphical Models: 2004: 17 A very brief history…. • Problems recognized by 1970’s, 80’s in artificial intelligence (AI) – E.g., see early AI work in constructing diagnostic reasoning systems: • e.g., medical diagnosis, N=100 different symptoms – 1970’s/80’s Solutions? • invent new formalisms – Certainty factors – Fuzzy logic – Non-monotonic logic – 1985: “In defense of probability”, P. Cheeseman, IJCAI 85 – 1990’s: marriage of statistics and computer science © P. Smyth: UC Irvine: Graphical Models: 2004: 18 Two Key Ideas • Problem 1: Computational Complexity – Idea: • Represent dependency structure as a graph and exploit sparseness in computation • Problem 2: Model Specification – Idea: • learn models from data using statistical learning principles © P. Smyth: UC Irvine: Graphical Models: 2004: 19 “…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.” Glenn Shafer and Judea Pearl Introduction to Readings in Uncertain Reasoning, Morgan Kaufmann, 1990 © P. Smyth: UC Irvine: Graphical Models: 2004: 20 Graphical Models • Dependency structure encoded by an acyclic directed graph – Node <-> random variable – Edges encode dependencies • Absence of edge -> conditional independence – Directed and undirected versions • Why is this useful? – A language for communication – A language for computation • Origins: – Wright 1920’s – 1988 • Spiegelhalter and Lauritzen in statistics • Pearl in computer science – Aka Bayesian networks, belief networks, causal networks, etc © P. Smyth: UC Irvine: Graphical Models: 2004: 21 Examples of 3-way Graphical Models A Conditionally independent effects: p(A,B,C) = p(B|A)p(C|A)p(A) B C B and C are conditionally independent given A © P. Smyth: UC Irvine: Graphical Models: 2004: 22 Examples of 3-way Graphical Models A B C © P. Smyth: UC Irvine: Graphical Models: 2004: 23 Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B) Examples of 3-way Graphical Models A B C © P. Smyth: UC Irvine: Graphical Models: 2004: 24 Markov dependence: p(A,B,C) = p(C|B) p(B|A)p(A) Directed Graphical Models B A p(A,B,C) = p(C|A,B)p(A)p(B) C © P. Smyth: UC Irvine: Graphical Models: 2004: 25 Directed Graphical Models B A p(A,B,C) = p(C|A,B)p(A)p(B) C In general, p(X1, X2,....XN) = © P. Smyth: UC Irvine: Graphical Models: 2004: 26 p(Xi | parents(Xi ) ) Example D A B E C F © P. Smyth: UC Irvine: Graphical Models: 2004: 27 G Example D A B E c F Say we want to compute p(a | c, g) © P. Smyth: UC Irvine: Graphical Models: 2004: 28 g Example D A B E c F g Direct calculation: p(a|c,g) = Sbdef p(a,b,d,e,f | c,g) Complexity of the sum is O(K4) © P. Smyth: UC Irvine: Graphical Models: 2004: 29 Example D A B E c F Reordering: Sd p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g) © P. Smyth: UC Irvine: Graphical Models: 2004: 30 g Example D A B E c F g Reordering: Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g) p(e|g) © P. Smyth: UC Irvine: Graphical Models: 2004: 31 Example D A B E c F Reordering: Sb p(a|b) Sd p(b|d,c) Se p(d|e) p(e|g) p(d|g) © P. Smyth: UC Irvine: Graphical Models: 2004: 32 g Example D A B E c F Reordering: Sb p(a|b) Sd p(b|d,c) p(d|g) p(b|c,g) © P. Smyth: UC Irvine: Graphical Models: 2004: 33 g Example D B E c F A g Reordering: Sb p(a|b) p(b|c,g) p(a|c,g) © P. Smyth: UC Irvine: Graphical Models: 2004: 34 Complexity is O(K), compared to O(K4) A More General Algorithm • Message Passing (MP) Algorithm – Pearl, 1988; Lauritzen and Spiegelhalter, 1988 – Declare 1 node (any node) to be a root – Schedule two phases of message-passing • nodes pass messages up to the root • messages are distributed back to the leaves – In time O(N), we can compute any probability of interest © P. Smyth: UC Irvine: Graphical Models: 2004: 35 Sketch of the MP algorithm in action © P. Smyth: UC Irvine: Graphical Models: 2004: 36 Sketch of the MP algorithm in action 1 © P. Smyth: UC Irvine: Graphical Models: 2004: 37 Sketch of the MP algorithm in action 1 © P. Smyth: UC Irvine: Graphical Models: 2004: 38 2 Sketch of the MP algorithm in action 1 3 © P. Smyth: UC Irvine: Graphical Models: 2004: 39 2 Sketch of the MP algorithm in action 2 1 3 © P. Smyth: UC Irvine: Graphical Models: 2004: 40 4 Complexity of the MP Algorithm • Efficient – Complexity scales as O(N K m) • N = number of variables • K = arity of variables • m = maximum number of parents for any node – Compare to O(KN) for brute-force method © P. Smyth: UC Irvine: Graphical Models: 2004: 41 Graphs with “loops” D A B E C F G Message passing algorithm does not work when there are multiple paths between 2 nodes © P. Smyth: UC Irvine: Graphical Models: 2004: 42 Graphs with “loops” D A B E C F General approach: “cluster” variables together to convert graph to a tree © P. Smyth: UC Irvine: Graphical Models: 2004: 43 G Junction Tree D B, E A C © P. Smyth: UC Irvine: Graphical Models: 2004: 44 F G Junction Tree D B, E A C F G Good news: can perform MP algorithm on this tree Bad news: complexity is now O(K2) © P. Smyth: UC Irvine: Graphical Models: 2004: 45 Probability Calculations on Graphs • Structure of the graph reveals – Computational strategy • Sparser graphs -> faster computation – Dependency relations • Automated computation of conditional probabilities – i.e., a fully general algorithm for arbitrary graphs – exact vs. approximate answers • Extensions – For continuous variables: • replace sum with integral – For identification of most likely values • Replace sum with max operator – Conditional probability tables can be approximated © P. Smyth: UC Irvine: Graphical Models: 2004: 46 Hidden or Latent Variables • In many applications there are 2 sets of variables: – Variables whose values we can directly measure – Variables that are “hidden”, cannot be measured • Examples: – Speech recognition: • Observed: acoustic voice signal • Hidden: label of the word spoken – Face tracking in images • Observed: pixel intensities • Hidden: position of the face in the image – Text modeling • Observed: counts of words in a document • Hidden: topics that the document is about © P. Smyth: UC Irvine: Graphical Models: 2004: 47 Mixture Models S Hidden discrete variable Y Observed variable(s) © P. Smyth: UC Irvine: Graphical Models: 2004: 48 A Graphical Model for Clustering S Y1 Yj Hidden discrete (cluster) variable Yd Observed variable(s) (assumed conditionally independent given S) © P. Smyth: UC Irvine: Graphical Models: 2004: 49 A Graphical Model for Clustering S Y1 Hidden discrete (cluster) variable Yj Yd Observed variable(s) (assumed conditionally independent given S) Clusters = p(Y1,…Yd | S = s) Clustering = learning these probability distributions from data © P. Smyth: UC Irvine: Graphical Models: 2004: 50 Mixtures of Markov Chains S Y1 Y2 Y3 YN Can learn model sets of sequences as coming from K different Markov chains © P. Smyth: UC Irvine: Graphical Models: 2004: 51 Mixtures of Markov Chains S Y1 Y2 Y3 YN Can learn model sets of sequences as coming from K different Markov chains Provides a useful method for clustering sequences Cadez, Heckerman, Meek, Smyth, 2003 • used to cluster 1 million Web user navigation patterns • algorithm is part of SQL Server 2005 © P. Smyth: UC Irvine: Graphical Models: 2004: 52 Hidden Markov Model (HMM) Y1 Y2 Y3 YN Observed ---------------------------------------------------S1 S2 S3 SN Hidden Two key conditional independence assumptions Widely used in: ….speech recognition, protein models, error-correcting codes Comments: - inference about S given Y is O(N) - if S is continuous, we have a Kalman filter © P. Smyth: UC Irvine: Graphical Models: 2004: 53 Generalized HMMs I1 Y1 Y2 Y3 Yn S1 S2 S3 Sn I2 I3 © P. Smyth: UC Irvine: Graphical Models: 2004: 54 In Learning from Data © P. Smyth: UC Irvine: Graphical Models: 2004: 55 Probabilistic Model © P. Smyth: UC Irvine: Graphical Models: 2004: 56 Probabilistic Model © P. Smyth: UC Irvine: Graphical Models: 2004: 57 Real World Data P(Data | Parameters) Probabilistic Model © P. Smyth: UC Irvine: Graphical Models: 2004: 58 Real World Data P(Data | Parameters) Probabilistic Model Real World Data P(Parameters | Data) © P. Smyth: UC Irvine: Graphical Models: 2004: 59 Parameters and Data Model parameters q y1 y2 y3 y4 Data observations are assumed IID here © P. Smyth: UC Irvine: Graphical Models: 2004: 60 Data = {y1,…y4} Plate Notation q q y1 y2 y3 yn yi i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 61 Maximum Likelihood q Model parameters yi Data = {y1,…yn} i=1:n Likelihood(q) = p(Data | q ) = p(yi | q ) Maximum Likelihood: qML = arg max{ Likelihood(q) } © P. Smyth: UC Irvine: Graphical Models: 2004: 62 Being Bayesian a Prior(q) = p( q | a ) q yi i=1:n Bayesian Learning: p( q | evidence ) = p( q | data, prior) © P. Smyth: UC Irvine: Graphical Models: 2004: 63 Learning = Inference in a Graph a q yi i=1:n q is unknown Learning = process of computing p( q | y’s, prior ) Information “flows” from green nodes to q node © P. Smyth: UC Irvine: Graphical Models: 2004: 64 Example: Gaussian Model a s m b yi i=1:n Note: priors and parameters are assumed independent here © P. Smyth: UC Irvine: Graphical Models: 2004: 65 Example: Bayesian Regression a q s xi yi i=1:n Model: yi = f [xi;q] + e, e ~ N(0, s2) p(yi | xi) ~ N ( f[xi;q] , s2 ) © P. Smyth: UC Irvine: Graphical Models: 2004: 66 b Learning with Hidden Variables q Si yi i=1:n Finding q that maximizes L(q) is difficult -> use iterative optimization algorithms © P. Smyth: UC Irvine: Graphical Models: 2004: 67 Learning with Hidden Variables • Guess at some initial parameters q0 • E-step (inference in a graph) – For each case, and each unknown variable compute p(S | known data, q0 ) • M-step: (optimization) – Maximize L(q) using p(S | ….. ) – This yields new parameter estimates q1 • This is the EM algorithm: – Guaranteed to converge to a (local) maximum of L(q) © P. Smyth: UC Irvine: Graphical Models: 2004: 68 Mixture Model q Si yi i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 69 E-Step q Si yi i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 70 M-Step q Si yi i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 71 E-Step q Si yi i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 72 ANEMIA PATIENTS AND CONTROLS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 Data from Prof. Christine McLaren, Dept of Epidemiology, UC Irvine 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 73 3.8 3.9 4 Mixture Model q Ci Two hidden clusters C=1 and C= 2 yi P(y | C = k) is Gaussian i=1:n © P. Smyth: UC Irvine: Graphical Models: 2004: 74 Model for y = mixture of two 2d-Gaussians EM ITERATION 1 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 75 3.8 3.9 4 EM ITERATION 3 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 76 3.8 3.9 4 EM ITERATION 5 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 77 3.8 3.9 4 EM ITERATION 10 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 78 3.8 3.9 4 EM ITERATION 15 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 79 3.8 3.9 4 EM ITERATION 25 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 80 3.8 3.9 4 ANEMIA DATA WITH LABELS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 Control Group 4.1 4 Anemia Group 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume © P. Smyth: UC Irvine: Graphical Models: 2004: 81 3.8 3.9 4 Application 1: Probabilistic Topic Modeling in Text Documents Collaborators: Mark Steyvers, UC Irvine Michal Rosen-Zvi, UC Irvine Tom Griffiths, Stanford/MIT © P. Smyth: UC Irvine: Graphical Models: 2004: 82 The slides on author-topic modeling are in a separate Powerpoint file (with author-topic in the title) © P. Smyth: UC Irvine: Graphical Models: 2004: 83 Application 2: Modeling Sets of Curves Collaborators: Scott Gaffney, Dasha Chudova, UC Irvine, Andy Robertson, Suzana Camargo, Columbia University © P. Smyth: UC Irvine: Graphical Models: 2004: 84 Graphical Models for Curves Data = { (y1,t1),……. yT, tT) } q t y n y = f(t ; q ) e.g., y = at2 + bt + c, © P. Smyth: UC Irvine: Graphical Models: 2004: 85 q = {a, b, c} Graphical Models for Curves q t s y T points y ~ Gaussian density with mean = f(t ; q ), variance = s2 © P. Smyth: UC Irvine: Graphical Models: 2004: 86 Example y t © P. Smyth: UC Irvine: Graphical Models: 2004: 87 Example f(t ; q ) <- this is hidden y t © P. Smyth: UC Irvine: Graphical Models: 2004: 88 Sets of Curves TIME-COURSE GENE EXPRESSION DATA 2 Normalized log-ratio of intensity 1.5 1 0.5 0 -0.5 -1 Yeast Cell-Cycle Data Spellman et al (1998) -1.5 -2 0 2 4 6 8 10 12 Time (7-minute increments) © P. Smyth: UC Irvine: Graphical Models: 2004: 89 14 16 18 Sets of Curves © P. Smyth: UC Irvine: Graphical Models: 2004: 90 Clustering “non-vector” data • Challenges with the data…. – May be of different “lengths”, “sizes”, etc – Not easily representable in vector spaces – Distance is not naturally defined a priori • Possible approaches – “convert” into a fixed-dimensional vector space • Apply standard vector clustering – but loses information – use hierarchical clustering • But O(N2) and requires a distance measure – probabilistic clustering with mixtures • Define a generative mixture model for the data • Learn distance and clustering simultaneously © P. Smyth: UC Irvine: Graphical Models: 2004: 91 Graphical Models for Sets of Curves q t s y T N curves Each curve: P(yi | ti, q ) = product of Gaussians © P. Smyth: UC Irvine: Graphical Models: 2004: 92 Curve-Specific Transformations q Note: we can learn function parameters and shifts simultaneously with EM t a s y T N curves e.g., E[yi] = at2 + bt + c + ai, © P. Smyth: UC Irvine: Graphical Models: 2004: 93 q = {a, b, c, a1,….aN} Learning Shapes and Shifts Data = smoothed growth acceleration data from teenagers EM used to learn a spline model + time-shift for each curve Original data © P. Smyth: UC Irvine: Graphical Models: 2004: 94 Data after Learning Clustering: Mixtures of Curves q c t a s y T N curves Each set of trajectory points comes from 1 of K models Model for group k is a Gaussian curve model Marginal probability for a trajectory = mixture model © P. Smyth: UC Irvine: Graphical Models: 2004: 95 Clustering Methodology • Mixtures of polynomials and splines – model data as mixtures of noisy regression models – 2d (x,y) position as a function of time – use the model as a first-order approximation for clustering • Compare to vector-based clustering... – provides a quantitative (e.g., predictive) model – can handle • variable-length trajectories • missing measurements • background/outlier process • coupling of other “features” (e.g., intensity) © P. Smyth: UC Irvine: Graphical Models: 2004: 96 Winter Storm Tracks • Highly damaging weather • Important watersource • Climate change implications © P. Smyth: UC Irvine: Graphical Models: 2004: 97 Data – Sea-level pressure on a global grid – Four times a day, every 6 hours, over 30 years Blue indicates low pressure © P. Smyth: UC Irvine: Graphical Models: 2004: 98 Clusters of Trajectories © P. Smyth: UC Irvine: Graphical Models: 2004: 99 Cluster Shapes for Pacific Cyclones © P. Smyth: UC Irvine: Graphical Models: 2004: 100 TROPICAL CYCLONES Western North Pacific 1983-2002 © P. Smyth: UC Irvine: Graphical Models: 2004: 101 © P. Smyth: UC Irvine: Graphical Models: 2004: 102 Hierarchical Bayesian Models c qi t a p s y T N curves Each curve is allowed to have its own parameters -> “clustering in parameter space” © P. Smyth: UC Irvine: Graphical Models: 2004: 103 Summary • Graphical Models – Representation language for complex probabilistic models – Provide a systematic framework for • Model description • Probabilistic computation – General framework for learning from data • Applications – author-topic models for text data – mixture of regressions for sets of curves – …. many more • Extensions – Hierarchical Bayesian models – Spatio-temporal models – …… © P. Smyth: UC Irvine: Graphical Models: 2004: 104 Further Information • Papers online at www.datalab.uci.edu – Graphical models • Smyth, Heckerman, Jordan, 1997, Neural Computation – Topic models • Steyvers et al, ACM SIGKDD 2004 • Rosen-Zvi et al, UAI 2004 – Curve clustering • Gaffney and Smyth, NIPS 2004 • Scott Gaffney, Phd thesis, UCI 2004 • Author-Topic Browser: www.datalab.uci.edu/author-topic – JAVA demo of online browser – additional tables and results