COMP 538 (Reasoning and Decision under Uncertainty) Introduction to Bayesian networks Introduction to Course COMP 538 Introduction / Slide 2 Nevin L. Zhang Bayesian Networks Bayesian networks Are networks of random variables Are a marriage between probability theory and graph theory. Represent conditional independence – A random variable is directly related to only a few neighboring variables. – It is independent of all other variables given the neighboring variables. Facilitate the application of probability theory to many problems in AI, Applied Mathematics, Statistics, and Engineering that – Are complex – Involve uncertainty. COMP 538 Introduction / Slide 3 Nevin L. Zhang Probability Theory & Uncertainty in AI Bayesian networks have been developed in the AI community as a tool to build intelligent reasoning systems, in particular expert systems. We next provide a brief historic account of the development. Prior to 1980, intelligent reasoning systems were based on symbolic logic. To tackle uncertainty, numerical tags were attached with if-then rules. Sometimes, the numbers were interpreted probabilistically (MYCIN (Buchanan et al 1984), PROSPECTOR). The probabilistic interpretation is not justified because the numbers were not manipulated according principles of probability theory. COMP 538 Introduction / Slide 4 Nevin L. Zhang Probability Theory & Uncertainty in AI Rule-based systems Uncertainty associated with rules: summarizes exceptions – Consider rule If the ground is wet, then it rained. – Exceptions: Sprinkler was on, water truck leaked, water pipe bursted, … – In general, exceptions too many to explicate. Summarized by a weight: If the ground is wet, then it rained (0.8). Application of rule “if A then B”: – If you see A in knowledge base, then conclude B – (Locality) regardless of other things in knowledge base – (Detachment) regardless of how A was derived. COMP 538 Introduction / Slide 5 Nevin L. Zhang Probability Theory & Uncertainty in AI Problem with locality (Pearl 1988) – Rule 1: If ground wet, then it rained (0.8). – If we see wet ground, we conclude that it rained with 80% prob. – But what if, somewhere in the knowledge base, there is the sentence “Sprinkler on last night”? Problem with detachment (Pearl 1988) – Rule 2: If Sprinkler on (last night), then ground wet (this morning). – We know: Sprinkler on – Using rule 2: Ground wet – Using rule 1 (Detachment here): It rained. COMP 538 Introduction / Slide 6 Nevin L. Zhang Probability Theory & Uncertainty in AI Detachment also implies that there is no way to determine whether two pieces of information originate from the same source or from two independence sources (Pearl 1988, Henrion 1986). – Analogy: Shanghai Disney larger than Hong Kong Disney? Shanghai Business Man Statement TV report Newspaper report My Belief COMP 538 Introduction / Slide 7 Nevin L. Zhang Probability Theory & Uncertainty in AI Rule-based systems can operate safely only in tree structured networks and they can perform either diagnosis or prediction, but not both (Shafer and Pearl 1990, Introduction to Chapter 5). Classic logic does not suffer from those problems because the truth value characterize logical formulae themselves rather than exceptions. COMP 538 Introduction / Slide 8 Nevin L. Zhang Probability Theory & Uncertainty in AI Model-based systems Uncertainty measure not on individual logical formulae, rather on sets of possible worlds. In the case of probability models, there is a probability distribution over the sample space --- the Cartesian product of the state spaces of all random variables. Or a joint probability over all the random variables. – Well-known and well-understood framework for uncertainty – Clear semantics – Provides principled answers for: Combining evidence Predictive & Diagnostic reasoning Belief update COMP 538 Introduction / Slide 9 Nevin L. Zhang Probability Theory & Uncertainty in AI Difficulties in applying probability theory: Complexity of model construction Complexity of problem solving Both exponential in problem size: number of variables. Example: Patients in hospital are described by several attributes: – Background: age, gender, history of diseases, … – Symptoms: fever, blood pressure, headache, … – Diseases: pneumonia, heart attack. … A joint probability distribution needs to assign a number to each combination of values of these attributes, exponential model size. – 20 attributes require 220 =106 numbers – Real applications usually involve hundreds of attributes One of the reasons why probability theory did not play a significant role in AI reasoning systems before 1980. COMP 538 Introduction / Slide 10 Nevin L. Zhang Probability Theory & Uncertainty in AI The breakthrough came in early 1980’s (Pearl 1986, 1988, Howard & Matheson 1984) In a joint probability, every variable is, in theory, directly related to all other variables. Pearl and others realized: – It is often reasonable to make the, sometimes simplifying, assumption that each variable is directly related to only a few other variables. – This leads to modularity: Splitting a complex model and its associated calculations into small manageable pieces. COMP 538 Introduction / Slide 11 Nevin L. Zhang Probability Theory & Uncertainty in AI Example: Africa Visit (Lauritzen & Spiegelhalter 1988, modified) Variables: – Patient complaint: Dyspnea --- D, – Q&A and Exam: Visit-Africa --- A; Smoking --- S; X-Ray --- X – Diagnosis Lung Cancer --- L; Tuberculosis --- T; Bronchitis --- B Assuming all variables binary, size of joint probability model P(A, S, T, B, D, X) is 64-1 COMP 538 Introduction / Slide 12 Nevin L. Zhang Probability Theory & Uncertainty in AI Reasonable assumptions: X directly influenced by T & L; conditioned on T & L, it is independent of all other variables. D directly influenced by T, L, B; conditioned on T, L, B, it is independent of all other variables. A directly influences T. Smoking directly influences L & B Break up model P(A, S, T, L, B, X, D) P(A), P(S), P(T|A), P(L|S), P(B|S) P(TorL|T, L),P(X|TorL), P(D|TorL, B) Total number of parameters – 1+1+2+2+2+4+2+4=18 COMP 538 Introduction / Slide 13 Nevin L. Zhang Probability Theory & Uncertainty in AI Modularity (conditional independence) Simplifies model construction – 18 parameters instead of 63 parameters – More drastic in real-world applications. Model construction, inference, and model learning become possible for realistic applications – Before: exponential in problem size --- number of all variables. – Now: exponential in “number of neighbors” (more precisely size of largest clique.) COMP 538 Introduction / Slide 14 Nevin L. Zhang Probability Theory & Uncertainty in AI After the breakthrough 1980’s – Representation (graphical representation of conditional independence) – Inference (polytree propagation, clique tree propagation) 1990’s – Inference (Variable elimination, search, MCMC, variational methods, special local structures, …) – Learning (Parameter learning, structure learning, incomplete data, latent variables, …) – Sensitivity analysis, temporal models, causal models, …. COMP 538 Introduction / Slide 15 Nevin L. Zhang Impact of Bayesian Networks From non-existence to prominence Prior to 1980, probability theory had essentially no role in AI. 1980-1990: The breakthrough and much research activity. – However, by 1990, “there is till no consensus on the theoretical and practical role of probability in AI” (Shafer and Pearl 1990, Introduction) 2002: The role of Bayesian networks in AI is so prominent that the the first invited talk at AAAI-2002 was entitled – Probabilistic AI (M. Jordan) COMP 538 Introduction / Slide 16 Nevin L. Zhang Impact of Bayesian Networks Bayesian networks are now a major topic in influential textbooks on AI (Russell & Norvig 1995, Artificial intelligence : a modern approach) Machine Learning (Mitchell 1997, Machine Learning) They are also discussed in textbooks on Data Mining (Hand et al 2001, Principles of Data Mining) Pattern Recognition (Duda et al 2000, Pattern Classification) COMP 538 Introduction / Slide 17 Nevin L. Zhang Impact of Bayesian Networks Impact beyond AI In statistics, Bayesian networks are – Viewed as a kind of statistical models, just as regress models – Called graphical models. – Used for multivariate data analysis. A side note: Bayesian networks vs Neural networks – Bayesian networks: model data generation process, more interpretable. – Neural networks: motivated by biological process, less interpretable. COMP 538 Introduction / Slide 18 Nevin L. Zhang Impact of Bayesian Networks Bayesian networks provide a uniform framework to view a variety of models in Statistics and Engineering: – Hidden Markov models, Mixture models, latent class models, Kalman filters, factor analysis, and Ising models. Books were written that – Use Bayesian networks to explain algorithms in digital communication, in particular data compress and channel coding (Frey 1998, Graphical models for machine learning and digital communication) – Draw connections between Bayesian networks and contemporary cognitive psychology (Glymour 2001, The mind’s arrows: Bayes nets and graphical causal models in psychology) COMP 538 Introduction / Slide 19 Nevin L. Zhang Impact of Bayesian Networks Applications: too many to survey. Some random samples: Medical diagnostic systems, Real-time weapons scheduling, Jet-engines fault diagnosis, Intel processor fault diagnosis (Intel), Generator monitoring expert system (General Electric), Software troubleshooting (Microsoft office assistant, Win98 print troubleshooting), Space shuttle engines monitoring(Vista project), Biological sequences analysis and classification COMP 538 Introduction / Slide 20 Nevin L. Zhang Contents of This Course This course is designed for graduate students in science and engineering. Our objective is to give an in-depth coverage of what we deem the core concepts, ideas, and results of Bayesian networks Concept and semantics of Bayesian networks Representational issues: what can/cannot be represented Inference: How to answer queries efficiently Learning: How to adapt/learn Bayesian networks based on/from data Special models: Hidden Markov models, Latent class models Bayesian networks for classification and cluster analysis