Graphical Causal Models Clark Glymour Carnegie Mellon University Florida Institute for Human and Machine Cognition 1 Outline Part I: Goals and the Miracle of d-separation Part II: Statistical/Machine Learning Search and Discovery Methods for Causal Relations Part III: A Bevy of Causal Analysis Problems 2 I. Brains, Trains, and Automobiles: Cognitive Neuroscience as Reverse Auto Mechanics Idea: Like autos, like trains, like computers, brains have parts. The parts influence one another to produce a behavior. The parts can have roles in multiple behaviors. Big parts have littler parts. 3 I. Goals of the Automobile Hypothesis Overall goals: Identify the parts critical to behaviors of interest. Figure out how they influence one another, in what timing sequences. Imaging goals Identify relatively BIG parts (ROIs). Figure out how they influence one another, with what timing sequences, in producing behaviors of interest. 4 I. Goal: From Data to Mechanisms A X Y Z B W C Causal Relations among Neurally Localized Variables D Multivariate Time Series 5 I. Graphical Causal Models: the Abstract Structure of Influences Vehicle deceleration Friction of pads Friction of shoe against rotor against wheel Fluid in caliper Fluid in wheel cyiinder Fluid level in This system is deterministic (we hope) master cylinder Push brake 6 I. Philosophical Objections “Cause” is a vague, metaphysical notion. “Probability” has a mathematical structure. “Causation” does not. Response: Compare “probability.” Response: See Spirtes, et al., Causation, Prediction and Search, 1993, 2000; Pearl, Causality, 2000. Listen to Pearl’s lecture this afternoon. The real causes are at the synaptic level, so talk of ROIs as causes is nonsense. “…for many this rhetoric represents a category error…because causal [sic] is an attribute of the state equation.” (Friston, et al, 2007, 602.) Response: So, do you think “smoking causes cancer” is nonsense? “Human activities cause global temperature increases” is nonsense? “Turning the ignition key causes the car to start” is nonsense? 7 I. The Abstract Structure of This system is not Influences deterministic Linear causal models (SEMs) specify a directed graphical structure. MedFGlb : = a CING(b) + e1 STG(b) : = b CING(b) + e2 IPL(b) := c STG(b) + d CING(b) + e3 e1, e2, e3 jointly independent But so does any functional form of the influences: MedFGlb : = f(CING(b) ) + e1 STG(b) : = g(CING(b) + e2 IPL(b) := h(STG(b), CING(b)) + e3 e1, e2, e3 jointly independent S. Hanson, et al., 2008. Middle Occipital Gyrus (mog), Inferior Parietal Lobule (ipl), Middle Frontal Gyrus(mfg), and Inferior Frontal Gyrus (ifg) Middle Occipital Gyrus (mog), Inferior Parietal Lobule (ipl), Middle Frontal Gyrus(mfg), and Inferior Frontal Gyrus (ifg) 8 I. So What? 1. The directed graph codes the conditional independence relations implied by the model: MedFGl(b) II {STG(b), IPL(b}) | CING(b). 2. (Almost) All of our tests of models are tests of implications of their conditional independence claims. So what is the code? 9 I. d-separation Is the Code! X Y Z W X II {Z, W} | Y X II W | Z NOT X II W | R R NOT X || W | S J. Pearl, 1988 S What about systems with cycles? d-separation characterizes conditional independence relations in all such linear systems. P. Spirtes, 1996 NOT X || W | {Y, Z, R} NOT X || W | {Y, Z, S} Conditioning on a variable in a directed path between X, W blocks the association produced by that path Conditioning a variable that is a descendant of X, W creates a path that produces an association between X, W 10 I. How To Determine If Variables A and Z Are Independent Conditional on a Set Q of Variables. 1. 2. 3. Consider each sequence p of edge adjacent variables (each in any direction) without self intersections terminating in A and Z. A collider on p is a variable N on p such that variables M, O on p each have edges directed into N: M -> N <- O Sequence (path) p creates a dependency between A and Z conditional on Q if and only if: 1. 2. No non-collider on p is in Q. Every collider on p is in Q or has a descendant in Q (a directed path from the collider to a member of Q.) 11 II. So, What Can We Do With It? Exploit d-separation in conjunction with distribution assumptions to estimate graphical causal structure from sample data. Understand when data analysis and measurement methods distort conditional independence relations in target systems. Wrong conditional independence relations => wrong d-separation relations => wrong causal structure. 12 II. Simple Illustration (PC) Truth: X Y Consequences: X || Z {X,Z} || W | Y Z W Method: Y Y X Z X Z W W Y X Z W Spirtes, Glymour, & Scheines. (1993). Causation, Prediction, & Search, Springer Lecture Notes in Statistics. Y X Z W 13 II. Bayesian Search: Greedy Equivlence Search (GES) Start with empty graph. 2. Add or change the edge that most increases fit. 3. Iterate. Truth: X Y Data 1. Z W X Y Z W Chickering and Meek, Uncertainty in Artificial Intelligence Proceedings, 2003 Model with highest posterior probability 14 II. With Unknown, Unrecorded Confounders: FCI Truth X Data Y Z FCI W Unrecorded Variable Consistent estimator under i.i.d. sampling Spirtes, et al., Causation, Prediction and Search X Y Z But in other cases is often uninformative W 15 II. Overlapping Databases: ION W Truth: X Z W Y D1 R S X Y Z S R 1. 4.8 2. -4.7 10.1 5 2 2. 8. 7.4 0.3 -5.1 3 8 … … … … … 7.2 3. 1.8 9.2 7.0 But in other 5 cases often generates a D2 1. 4.8 11.2 12.1 6. number of ION algorithm recovers 1 5 alternative the full graph! models … … … … … Danks, Tillman and Glymour, NIPS, 2008. 16 II. Time Series (Structural VAR) Basic idea: PC or GES style search on “relative” time-slices Additive, nonlinear model of climate teleconnections (5 ocean indices; 563-month series) Chu & Glymour, 2008, Journal of Machine Learning Research 17 II. Discovering Latent Variables T1 M1 M2 T3 Truth: M3 T2 M4 M5 M6 M9 M7 M10 M11 M12 M8 Apply GES Cluster M’s using a heuristic or Build Pure Clusters M1 M2 M3 (Silva, et al. JMLR. 2006) Applicable to time series? T2 T3 T1 M9 M10 M11 M12 M5 M6 18 II. Limits of PC and GES X X Y Z …predicts the same independencies as… Y Z X All of these are dseparation equivalent X Z Y X Z With i.i.d. samples, and correct distribution families, PC and GES give correct information almost surely in the large sample limit— assuming no unrecorded common causes Works with “random effects “ for linear models. But doe not give all the information we want: Often cannot determine the directions of influences! Can post process with exhaustive test for all orientations— heuristic. Adjacencies more reliable than directions of edges Y Y Z X Y Z 19 II. Breaking Down d-separation Equivalence: LiNGAM X Y Z Linear equations (reduced): X = X Y = aXX + Y Z = bXX + bYY + Z Disturbance terms must be nonGaussian Discoverable by LiNGaM (ICA + algebra)! Shimizu, et al. (2006) Journal of Machine Learning Research 20 II. Feedback Systems Two methods: Modified LiNGaM Truth: X W Lacerda, Spirtes, & Hoyer (2008). Discovering cyclic causal models by independent component analysis. UAI. Conditional Y Z independencies Richardson & Spirtes (1999). Discovery of linear cyclic models. X W X W Y Z Y Z 21 II. Missed Opportunities? None of the machine learning/statistical methods in II. have been used with imaging data. Instead: Trial and error guessing and data fitting Regression Granger Causality for time series. Exhaustive testing of all linear models. How come? Unfamiliarity The machine learning/statistical methods respect what it is possible to learn (in the large sample limit), which is often less than researchers want to conclude. 22 III. Simple Possible Errors Pooling data from different subjects: If X and Y are independent in population P1 and in population P2, but have different probability distributions in the two populations, the X and Y are not usually not independent in P1 P2. (G. Yule, 1904). Pooling data from different time points in fMRI series If the series is not stationary, data are being pooled as above. Can remove trends but that doesn’t guarantee stationarity. 23 III. Eliminating Opportunities Removing autocorrelation by regression interferes with discovering feedback between variables. Data manipulations that tend to make variables Gaussian Spatial smoothing Variables defined by principal components or averages over ROIs eliminate or reduce the possibility of taking advantage of LiNGAM algorithms. 24 III. Simple Limitations Testing all models (e.g., with LISREL chisquare) is a consistent search method for linear, Gaussian models (folk theorem). But it is not feasible except for very small numbers of variables, e.g., for 8 variables there are 324 = 22,876,792,454,961 directed graphs. 25 III. Not So Simple Possible Errors: Variables Defined on ROIs as Proxies for Latent Variables X Y Z A B C X is independent of Z conditional on Y But unless B is a perfect measure of Y, A is not independent of C conditional on B. So if A, B, and C are taken as “proxies” for X, Y and Z, a regression of C on A and B will find, correctly, that X has an indirect influence on Z, through Y, but also, incorrectly, that X has in addition a direct influence on Z not through Y. 26 III. Not So Obvious Errors: Regression Lots of forms: linear, polynomial, logistic, etc. All have the following features: Prior separation of variables into outcome, Y, and a set S of possible causes, A, B, C, etc. of Y. Regression estimate of the influence of A on Y is a measure of the association of A and Y conditional on all other variables in S. Regression for causal effects always attempts to estimate the direct (relative to other variables in S) influence of A on Y. 27 III. Regression to Estimate Causal Influence • Let V = {X,Y,T}, where - Y : measured outcome - measured regressors: X = {X1, X2, …, Xn} - latent common causes of pairs in X U Y: T = {T1, …, Tk} • Let the true causal model over V be a Structural Equation Model in which each V V is a linear combination of its direct causes and independent, Gaussian noise. 28 III. Regression to estimate Causal Influence Consider the regression equation: Y = 0 + 1X1 + 2X2 + ..… nXn Let the OLS regression estimate i be the estimated causal influence of Xi on Y. That is, hypothetically holding X/Xi experimentally constant, i is an estimate of the change in E(Y) that would result from an intervention that changes Xi by 1 unit. Let the real Causal Influence Xi Y = bi When is the OLS estimate i a consistent estimate of bi? 29 III. Regression Will Be “inconsistent” When 1. There is an unrecorded common cause of Y and Xi L Xi Y If X, Y are the only measured variables, PC, GES and FCI cannot determine whether the influence is from X to Y or from an unmeasured common cause, or both. LiNGAM can if the disturbances are nonGaussian. 30 Regression will be “inconsistent” when 2. Cause and effect are confused: Xi Y “…one region, with a long haemodynamic latency, could cause a neuronal response in another that was expressed, haedynamically, before the source.” (Friston, et al., 2007, 602). LiNGAM does not make this error. 3. And that error can lead to others: Xi Xk Y Regression concludes Xk is cause of Y. FCI, etc. do not make these errors. 31 BadTRegression Example 1 X1 X2 T2 X3 Y True Model 1 0 2 0 X 3 0 X Multiple Regression Result PC, GES, FCI get these kinds of cases right. 32 Regression Consistency • • If Xi is d-separated from Y conditional on X\Xi in the true graph after removing Xi Y, and X contains no descendant of Y, then: i is a consistent estimate of bi 33 III. Granger Causality Idea: Time series X is a Granger cause of Y iff stationary {…..Xt-1; ….Yt-1} predicts Yt better than does {….Yt-1} Obvious Generalizations: Non-Gaussian time series. Multiple time series—essentially time series version of multiple regression: X is a Granger cause of Y iff Yt is not independent of …Xt1 conditional on covariates …Zt-1. Less obvious generalizations: Non-linear time series (finding conditional independence tests is touchy) C. Granger, Econometrica, 1969 34 GC All Over the Place Goebel, R. Roebroeck, A. Kim, D. and Formisano, E. (2003). Investigating directed cortical interactions in time-resolved fMI data using vector autoregressive modeling and Granger causality mapping. Magnetic Resonance Imaging, 21: 125-161. Chen, Y. Bressler, S.L., Knuth, K.H., Truccolo, W.A., Ding, M.Z., (2006). Stochastic modeling of neurobiological time series: power, coherence, Granger causality, and separation of evoked responses from ongoing activity. Chaos 16, 26-113. Brovelli, A., Ding, M.Z., Ledberg, A., Chen, Y.H., Nakamura, R., Bressler,S.L., (2004). Beta oscillations in a large-scale sensorimotor cortical network: directional influences revealed by granger causality. Proc. Natl. Acad. Sci. U. S. A. 101: 9849–9854. Deshpande, G., Hu, ., Stilla, R, and K. Sathian, (2008) Effective connectivity during haptic perception: A study using Granger causality analysis of functional magnetic resonance imaging data. NeuroImage, 40: 1807-1814. 35 III. Problems with GC • fMRI series with multiple conditions are not stationary --May not always be serious. • GC can produce causal errors when there is measurement error or unmeasured confounding series. – --Open research problem: find a consistent method to identify unrecorded common causes of time series, akin to Silva, et al., JMLR 2006 for equilibrium data; Glymour and Spirtes, J. of Econometrics, 1988. 36 III. If Xt records an event occurring later than Yt+1, X may be mistakenly taken to be a cause of Y. (Friston, 2007, again.) • • This is a problem for regression; Not a problem if PC, FCI, GES or LiNGAM are used in estimating the “Structural VAR” because they do not require a separation of variables into outcome and potential cause, or a time ordering of variables. 37 III. Granger Causality and Mechanisms Neural signals occur faster than fMRI sampling rate—what is going on in between? Granger Causes ARE: X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 W X Unobserved X4 Y4 Z4 Y Z Spurious edges W4 38 III. Analysis of Residuals Regress and apply PC, etc. to residuals Regress on X1, Y1, Z1, W1; X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 W X Unobserved X4 Y4 Z4 Y Z W4 Swanson and Granger, JASA; Demiralp and Hoover (2003), Oxford Economic Bulletin 39 Conclusion Causal inference from imaging data is about as hard as it gets; Conventional statistical procedures are radically insufficient tools; Lots of unused potentially relevant, principled, tools in the Machine Learning literature; Measurement methods and data transformations can alter the probability distributions in destructive ways; Graphical causal models are the best available tool for thinking about the statistical constraints that causal hypotheses imply. 40 Things There Aren’t: Magic Wands Pixie Dust 41 If You Forget Everything Else in This Talk, Remember This: P. Spirtes, et al., Causation, Prediction and Search, Springer Lecture Notes in Statistics, 2nd edition MIT Press, 2000 J. Pearl, Causality, Oxford, 2000. Uncertainty in Artificial Intelligence Annual Conference Proceedings Journal of Machine Learning Research Peter Spirtes’ webpage Judea Pearl’s web page. The TETRAD webpage. 42