Introduction to Probabilistic Graphical Models Eran Segal Weizmann Institute Logistics Staff: Instructor Teaching Assistants Ohad Manor (ohad.manor@weizmann.ac.il, room 125) Course information: Eran Segal (eran.segal@weizmann.ac.il, room 149) WWW: http://www.weizmann.ac.il/math/pgm Course book: “Bayesian Networks and Beyond”, Daphne Koller (Stanford) & Nir Friedman (Hebrew U.) Course structure One weekly meeting Homework assignments Sun: 9am-11am 2 weeks to complete each 40% of final grade Final exam 3 hour class exam, date will be announced 60% of final grade Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem Probabilistic Graphical Models Increasingly important in Machine Learning Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism Graphical models provides a common framework Advantage: specialized techniques developed in one field can be transferred between research communities Representation: Graphs Intuitive data structure for modeling highly-interacting sets of variables Explicit model for modularity Data structure that allows for design of efficient general-purpose algorithms Reasoning: Probability Theory Well understood framework for modeling uncertainty Partial knowledge of the state of the world Noisy observations Phenomenon not covered by our model Inherent stochasticity Clear semantics Can be learned from data Probabilistic Reasoning In this course we will learn: Semantics of probabilistic graphical models (PGMs) Bayesian networks Markov networks Answering queries in a PGMs (“inference”) Learning PGMs from data (“learning”) Modeling temporal processes with PGMs Hidden Markov Models (HMMs) as a special case Course Outline Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Topic Introduction, Bayesian network representation Bayesian network representation cont. Local probability models Undirected graphical models Exact inference Exact inference cont. Approximate inference Approximate inference cont. Learning: Parameters Learning: Parameters cont. Learning: Structure Partially observed data Learning undirected graphical models Template models Dynamic Bayesian networks Reading 1-3 1-3 5 4 9,10 9,10 12 12 16,17 16,17 18 19 20 6 15 A Simple Example We want to model whether our neighbor will inform us of the alarm being set off The alarm can set off if There is a burglary There is an earthquake Whether our neighbor calls depends on whether the alarm is set off A Simple Example Variables Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N) E B A N Prob. F F F F 0.01 F F F T 0.04 F F T F 0.05 F F T T 0.01 F T F F 0.02 F T F T 0.07 F T T F 0.2 F T T T 0.1 T F F F 0.01 T F F T 0.07 T F T F 0.13 T F T T 0.04 T T F F 0.06 T T F T 0.05 T T T F 0.1 T T T T 0.05 24-1 independent parameters A Simple Example B E F T F T 0.9 0.1 0.7 0.3 Earthquake Burglary A Alarm E B F T F F 0.99 0.01 F T 0.1 0.9 T F 0.3 0.7 T T 0.01 0.99 NeighborCalls N A F T F 0.9 0.1 T 0.2 0.8 7 independent parameters Example Bayesian Network The “Alarm” network for monitoring intensive care patients 509 parameters (full joint 237) 37 variables PULMEMBOLUS PAP MINVOLSET KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG VENITUBE PRESS MINOVL ANAPHYLAXIS SAO2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO2 VENTALV PVSAT ARTCO2 EXPCO2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO BP HR HREKG HRBP DISCONNECT ERRCAUTER HRSAT Application: Clustering Users Input: TV shows that each user watches Output: TV show “clusters” Assumption: shows watched by same users are similar Class 1 • Power rangers • Animaniacs • X-men • Tazmania • Spider man Class 2 • Young and restless • Bold and the beautiful • As the world turns • Price is right • CBS eve news Class 4 • 60 minutes • NBC nightly news • CBS eve news • Murder she wrote • Matlock Class 3 • Tonight show • Conan O’Brien • NBC nightly news • Later with Kinnear • Seinfeld Class 5 • Seinfeld • Friends • Mad about you • ER • Frasier App.: Recommendation Systems Given user preferences, suggest recommendations Example: Amazon.com Input: movie preferences of many users Solution: model correlations between movie features Users that like comedy, often like drama Users that like action, often do not like cartoons Users that like Robert Deniro films often like Al Pacino films Given user preferences, can predict probability that new movies match preferences Diagnostic Systems Diagnostic indexing for home health site at microsoft Enter symptoms recommend multimedia content Online TroubleShooters App.: Finding Regulatory Networks P(Level | Module, Regulators) Module HAP4 1 Expression level of CMK1 Regulator 1 in experiment 0 does What module gene “g” belong to? 0 Experiment Regulator1 0 BMH1 Regulator2 GIC2 2 Module Regulator3 Gene 0 0 Expression level in each module is a function of expression of regulators 0 Level Expression App.: Finding Regulatory Networks Experimentally tested regulator 11 8 GATA Enriched cis-Regulatory Motif Gat1 Regulator (transcription factor) 10 N11 Regulation supported in literature Hap4 Regulator (Signaling molecule) 9 GCN4 Inferred regulation 1 CBF1_B Energy and cAMP signaling Msn4 Xbp1 25 HAP234 3 Sip2 2 STRE 48 Module (number) Kin82 Tpk1 N41 N14 41 33 N13 MIG1 REPCAR CAT8 ADR1 N26 MCM1 18 13 17 15 14 DNA and RNA processing nuclear Yer184c Cmk1 Ppt1 Lsg1 4 N18 Pph3 26 Gis1 Tpk2 Gac1 30 42 N30 GCR1 HSF XBP1 HAC1 5 16 Yap6 Ypl230w ABF_C N36 47 39 Ime4 Bmh1 Gcn20 Not3 31 36 Amino acid metabolism Prerequisites Probability theory Conditional probabilities Joint distribution Random variables Information theory Function optimization Graph theory Computational complexity Probability Theory Probability distribution P over (, S) is a mapping from events in S such that: P() 0 for all S P() = 1 If ,S and =, then P()=P()+P() P ( ) ( | ) Conditional Probability: P P ( ) ( ) P ( |) P ( ) Chain Rule: P P ( | ) P ( ) P ( | ) P ( ) Bayes Rule: ( | ) P ( | ) ( | ) Conditional Independence: P Random Variables & Notation Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk Marginal distribution over X: P(X) Conditional independence: X is independent of Y given Z if: x X , y Y , z Z : ( X x Y y | Z z ) in P Expectation [ X ] xP ( x ) Discrete RVs: E P x X Continuous RVs: E [ X ] xp ( x ) P x X Linearity of expectation: (xy)P (x,y) xP (x,y) yP (x,y) x P (x,y) y P (x,y) xP (x) yP (y) E Y] P[X x,y x,y x x,y y x y x y E [X]E [Y] Expectation of products: (when X Y in P) EP[XY ] (x, y) xyP xyP (x)P(y) x,y x,y xP (x) yP ( y ) x y E[X]E[Y] Independence assumption Variance Variance of RV: X Var [ X ] E E [ X ] P P P 2 2 E X 2 XE [ X ] E X ] P P P[ 2 2 E [ X ] E [ 2 XE [ X ]] E [ E X ]] P P P P P[ 2 2 E [ X ] 2 E [ X ] E [ X ] E X ] P P P P[ 2 2 E [ X ] E X ] P P[ 2 If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] Var[aX+b]=a2Var[X] Information Theory ( X ) P ( x ) log P ( x ) Entropy: H P x X We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values of X 0 Hp(X) log|Val(X)| for any distribution P(X) Conditional entropy: H ( X | Y ) H ( X , Y ) H ( Y ) P P P P ( x , y ) log P ( x | y ) x X , y Y Information only helps: H ( X | Y ) H ( X ) P P Mutual information: I ( X ; Y ) H ( X ) H ( X | Y ) P P P P ( x |y ) P ( x , y ) log x X , y Y P ( x ) 0 Ip(X;Y) Hp(X) Symmetry: Ip(X;Y)= Ip(Y;X) Ip(X;Y)=0 iff X and Y are independent ( X ,..., X ) H ( X ) ... H ( X | X ,. X ) Chain rule of entropies: H P 1 n P 1 P n 1 n 1 Distances Between Distributions D(P॥Q)0 D(P॥Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality) P ( x ) Q ( x ) Relative Entropy: D ( P ( X ) Q ( X ) ) P ( x ) log x X L1 distance: L2 distance: L distance: P Q | P ( x ) Q ( x ) | 1 x X 1 2 2 P Q ( P ( x ) Q ( x )) 2 x X P Q max | P ( x ) Q ( x ) | x X Optimization Theory Find values 1,…n such that f ( ,..., ) ma f ( ' ,. ' ) 1 n ' ,..., ' 1 n1n Optimization strategies Solve gradient analytically and verify local maximum Gradient search: guess initial values, and improve iteratively Gradient ascent Line search Conjugate gradient Lagrange multipliers : j 1 ,... n , c 0 Solve maximization problem with constraints C j n Maximize L (, ) f () c j j j 1 Graph Theory Undirected graph Directed graph Complete graph (every two nodes connected) Acyclic graph Partially directed acyclic graph (PDAG) Induced graph Sub-graph Graph algorithms Shortest path from node X1 to all other nodes (BFS) Representing Joint Distributions Random variables: X1,…,Xn P is a joint distribution over X1,…,Xn If X1,..,Xn binary, need 2n parameters to describe P Can we represent P more compactly? Key: Exploit independence properties Independent Random Variables Two variables X and Y are independent if If X and Y are independent then: P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions of X P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) If X1,…,Xn are independent then: P(X1,…,Xn) = P(X1)…P(Xn) O(n) parameters All 2n probabilities are implicitly defined Cannot represent many types of distributions Conditional Independence X and Y are conditionally independent given Z if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not change predictions of X Notation: Ind(X;Y | Z) or (X Y | Z) Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1} P(I,S) I S P(I,S) i0 s0 0.665 i0 s1 0.035 i1 s0 0.06 i1 s1 0.24 P(I) P(S|I) I S i0 i1 I s0 s1 0.7 0.3 i0 0.95 0.05 i1 0.2 0.8 Joint parameterization Conditional parameterization 3 parameters 3 parameters Alternative parameterization: P(S) and P(I|S) Conditional Parameterization S = Score on test, Val(S) = {s0,s1} I = Intelligence, Val(I) = {i0,i1} G = Grade, Val(G) = {g0,g1,g2} Assume that G and S are independent given I Joint parameterization 223=12-1=11 independent parameters Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 21 independent parameters P(G|I) - 22 independent parameters 7 independent parameters Naïve Bayes Model Class variable C, Val(C) = {c1,…,ck} Evidence variables X1,…,Xn Naïve Bayes assumption: evidence variables are conditionally independent given C n P ( C , X ,..., X ) P ( C ) P ( X | C ) 1 n i i 1 Applications in medical diagnosis, text classification Used as a classifier: n P ( C c | x ,..., x ) ( x | C c ) P ( C c ) 1 1n i 1 1 P P ( C c | x ,..., x ) P ( C c ) P ( x | C c ) i 1 2 1 n 2 i 2 Problem: Double counting correlated evidence Bayesian Network (Informal) Directed acyclic graph G Nodes represent random variables Edges represent direct influences between random variables Local probability models I S Example 1 I S C G Example 2 X1 X2 … Xn Naïve Bayes Bayesian Network (Informal) Represent a joint distribution Specifies the probability for P(X=x) Specifies the probability for P(X=x|E=e) Allows for reasoning patterns Prediction (e.g., intelligent high scores) Explanation (e.g., low score not intelligent) Explaining away (different causes for same effect interact) I S G Example 2 Bayesian Network Structure Directed acyclic graph G Nodes X1,…,Xn represent random variables G encodes local Markov assumptions Xi is independent of its non-descendants given its parents Formally: (Xi NonDesc(Xi) | Pa(Xi)) A E {A,C,D,F} | B D B C E F G Independency Mappings (I-Maps) Let P be a distribution over X Let I(P) be the independencies (X Y | Z) in P A Bayesian network structure is an I-map (independency mapping) of P if I(G)I(P) I S I(G)={IS} I S P(I,S) I S P(I,S) i0 s0 0.25 i0 s0 0.4 i0 s1 0.25 i0 s1 0.3 i1 s0 0.25 i1 s0 0.2 i1 s1 0.25 i1 s1 0.1 I(P)={IS} I(P)= I S I(G)= Factorization Theorem n ( X ,..., X ) P ( X | Pa ( X ) If G is an I-Map of P, then P 1 n i i i 1 Proof: wlog. X1,…,Xn is an ordering consistent with G n P ( X ,..., X ) P ( X | X ,..., X ) By chain rule: 1 n i 1 i 1 i 1 From assumption: Pa ( X ) { X , X } i 1 , i 1 { X , X } Pa ( X ) Non ( X ) 1 , i 1 i i Since G is an I-Map (Xi; NonDesc(Xi)| Pa(Xi))I(P) P ( X | X ,..., X ) P ( X | Pa ( X )) i 1 i 1 i i Factorization Implies I-Map n P ( X ,..., X ) P ( X | Pa ( X )) G is an I-Map of P 1 n i i i 1 Proof: Need to show (Xi; NonDesc(Xi)| Pa(Xi))I(P) or that P(Xi | NonDesc(Xi)) = P(Xi | Pa(Xi)) wlog. X1,…,Xn is an ordering consistent with G P (X (X i,NonDesc i)) P (X (X i | NonDesc i)) P (NonDesc (X i)) i P (X |Pa (X)) k k 1 ki 1 P (X (X k |Pa k)) k 1 P (X (X i |Pa i)) Bayesian Network Definition A Bayesian network is a pair (G,P) P factorizes over G P is specified as set of CPDs associated with G’s nodes Parameters Joint distribution: 2n Bayesian network (bounded in-degree k): n2k Bayesian Network Design Variable considerations Structure considerations Clarity test: can an omniscient being determine its value? Hidden variables? Irrelevant variables Causal order of variables Which independencies (approximately) hold? Probability considerations Zero probabilities Orders of magnitude Relative values