Graphical Causal Models Clark Glymour Carnegie Mellon University Florida Institute for Human and

advertisement
Graphical Causal Models
Clark Glymour
Carnegie Mellon University
Florida Institute for Human and
Machine Cognition
1
Outline



Part I: Goals and the Miracle of d-separation
Part II: Statistical/Machine Learning Search
and Discovery Methods for Causal Relations
Part III: A Bevy of Causal Analysis Problems
2
I. Brains, Trains, and Automobiles: Cognitive
Neuroscience as Reverse Auto Mechanics
Idea: Like autos, like trains, like computers,
brains have parts.
The parts influence one another to
produce a behavior.
The parts can have roles in multiple
behaviors.
Big parts have
littler parts.
3
I. Goals of the Automobile
Hypothesis

Overall goals:



Identify the parts critical to behaviors of
interest.
Figure out how they influence one another, in
what timing sequences.
Imaging goals


Identify relatively BIG parts (ROIs).
Figure out how they influence one another,
with what timing sequences, in producing
behaviors of interest.
4
I. Goal: From Data to Mechanisms
A
X
Y
Z
B
W
C
Causal Relations
among Neurally
Localized Variables
D
Multivariate Time Series
5
I. Graphical Causal Models: the
Abstract Structure of Influences
Vehicle deceleration
Friction of pads
Friction of shoe
against rotor
against wheel
Fluid in caliper
Fluid in wheel
cyiinder
Fluid level in
This system is deterministic
(we hope)
master cylinder
Push brake
6
I. Philosophical Objections

“Cause” is a vague, metaphysical notion.


“Probability” has a mathematical structure. “Causation”
does not.


Response: Compare “probability.”
Response: See Spirtes, et al., Causation, Prediction and Search,
1993, 2000; Pearl, Causality, 2000. Listen to Pearl’s lecture this
afternoon.
The real causes are at the synaptic level, so talk of ROIs
as causes is nonsense.
“…for many this rhetoric represents a category error…because causal
[sic] is an attribute of the state equation.” (Friston, et al, 2007,
602.)

Response: So, do you think “smoking causes cancer” is
nonsense? “Human activities cause global temperature
increases” is nonsense? “Turning the ignition key causes the car
to start” is nonsense?
7
I. The Abstract Structure of
This system is not
Influences
deterministic
Linear causal models (SEMs)
specify a directed graphical
structure.
MedFGlb : = a CING(b) + e1
STG(b) : = b CING(b) + e2
IPL(b) := c STG(b) + d CING(b) + e3
e1, e2, e3 jointly independent
But so does any functional form of the
influences:
MedFGlb : = f(CING(b) ) + e1
STG(b) : = g(CING(b) + e2
IPL(b) := h(STG(b), CING(b)) + e3
e1, e2, e3 jointly independent
S. Hanson, et al., 2008. Middle
Occipital Gyrus (mog), Inferior Parietal Lobule (ipl),
Middle Frontal Gyrus(mfg), and Inferior Frontal Gyrus
(ifg) Middle Occipital Gyrus (mog), Inferior Parietal
Lobule (ipl), Middle Frontal Gyrus(mfg), and Inferior
Frontal Gyrus (ifg)
8
I. So What?
1. The directed graph codes the conditional
independence relations implied by the
model:
MedFGl(b) II {STG(b), IPL(b}) | CING(b).
2. (Almost) All of our tests of models are
tests of implications of their conditional
independence claims.
So what is the code?
9
I. d-separation Is the Code!
X
Y
Z
W
X II {Z, W} | Y
X II W | Z
NOT X II W | R
R
NOT X || W | S
J. Pearl, 1988
S
What about systems with cycles?
d-separation characterizes
conditional independence relations
in all such linear systems.
P. Spirtes, 1996
NOT X || W | {Y, Z, R}
NOT X || W | {Y, Z, S}
Conditioning on a variable in a
directed path between X, W blocks
the association produced by that path
Conditioning a variable that is a
descendant of X, W creates a path
that produces an association between
X, W
10
I. How To Determine If Variables A and Z Are
Independent Conditional on a Set Q of Variables.
1.
2.
3.
Consider each sequence p of edge adjacent
variables (each in any direction) without self
intersections terminating in A and Z.
A collider on p is a variable N on p such that
variables M, O on p each have edges directed into
N: M -> N <- O
Sequence (path) p creates a dependency between
A and Z conditional on Q if and only if:
1.
2.
No non-collider on p is in Q.
Every collider on p is in Q or has a descendant in Q (a
directed path from the collider to a member of Q.)
11
II. So, What Can We Do With
It?


Exploit d-separation in conjunction with
distribution assumptions to estimate
graphical causal structure from sample
data.
Understand when data analysis and
measurement methods distort conditional
independence relations in target systems.

Wrong conditional independence relations =>
wrong d-separation relations => wrong causal
structure.
12
II. Simple Illustration (PC)
Truth:
X
Y
Consequences:
X || Z {X,Z} || W | Y
Z
W
Method:
Y
Y
X
Z
X
Z
W
W
Y
X
Z
W

Spirtes, Glymour, & Scheines. (1993).
Causation, Prediction, & Search, Springer
Lecture Notes in Statistics.
Y
X
Z
W
13
II. Bayesian Search: Greedy
Equivlence Search (GES)
Start with empty
graph.
2.
Add or change the
edge that most
increases fit.
3.
Iterate.
Truth:
X
Y
Data
1.
Z
W
X
Y
Z
W
Chickering and Meek, Uncertainty
in Artificial Intelligence
Proceedings, 2003
Model with highest
posterior probability
14
II. With Unknown, Unrecorded
Confounders: FCI
Truth
X
Data
Y
Z
FCI
W
Unrecorded
Variable
Consistent estimator under i.i.d.
sampling
Spirtes, et al., Causation, Prediction and
Search
X
Y
Z
But in other cases is often
uninformative
W
15
II. Overlapping Databases: ION
W
Truth:
X
Z
W
Y
D1
R
S
X
Y
Z
S
R
1. 4.8 2. -4.7 10.1
5
2
2. 8. 7.4 0.3 -5.1
3 8
…
… … …
…
7.2 3. 1.8 9.2 7.0
But in other
5
cases often
generates a
D2
1. 4.8 11.2 12.1 6.
number of ION algorithm recovers
1
5
alternative
the full graph!
models
… … …
… …
Danks, Tillman and Glymour, NIPS,
2008.
16
II. Time Series (Structural VAR)


Basic idea: PC or
GES style search
on “relative”
time-slices
Additive, nonlinear model
of climate
teleconnections
(5 ocean indices;
563-month
series)

Chu & Glymour, 2008,
Journal of Machine
Learning Research
17
II. Discovering Latent Variables
T1
M1
M2
T3
Truth:
M3
T2
M4
M5
M6
M9
M7
M10
M11
M12
M8
Apply GES
Cluster M’s using
a heuristic or
Build Pure Clusters
M1 M2 M3
(Silva, et al. JMLR. 2006)
Applicable to
time series?
T2
T3
T1
M9 M10 M11 M12
M5 M6
18
II. Limits of PC and GES
X
X
Y
Z
…predicts the same
independencies as…
Y
Z
X
All of these are dseparation
equivalent





X
Z
Y
X
Z
With i.i.d. samples, and correct distribution families, PC and GES
give correct information almost surely in the large sample limit—
assuming no unrecorded common causes
Works with “random effects “ for linear models.
But doe not give all the information we want: Often cannot
determine the directions of influences!
Can post process with exhaustive test for all orientations—
heuristic.
Adjacencies more reliable than directions of edges
Y
Y
Z
X
Y
Z
19
II. Breaking Down d-separation
Equivalence: LiNGAM
X
Y
Z
Linear equations (reduced):
X = X
Y = aXX + Y
Z = bXX + bYY + Z
Disturbance terms
must be nonGaussian
Discoverable by
LiNGaM (ICA + algebra)!
Shimizu, et al. (2006) Journal of
Machine Learning Research
20
II. Feedback Systems
Two methods:
Modified LiNGaM
Truth:
X
W
Lacerda,
Spirtes, & Hoyer (2008). Discovering
cyclic causal models by independent component
analysis. UAI.
Conditional
Y
Z
independencies
Richardson
& Spirtes (1999). Discovery of
linear cyclic models.
X
W
X
W
Y
Z
Y
Z
21
II. Missed Opportunities?

None of the machine learning/statistical methods
in II. have been used with imaging data. Instead:





Trial and error guessing and data fitting
Regression
Granger Causality for time series.
Exhaustive testing of all linear models.
How come?


Unfamiliarity
The machine learning/statistical methods respect what
it is possible to learn (in the large sample limit), which
is often less than researchers want to conclude.
22
III. Simple Possible Errors

Pooling data from different subjects:


If X and Y are independent in population P1 and in
population P2, but have different probability
distributions in the two populations, the X and Y are
not usually not independent in P1  P2. (G. Yule,
1904).
Pooling data from different time points in fMRI
series

If the series is not stationary, data are being pooled
as above.

Can remove trends but that doesn’t guarantee stationarity.
23
III. Eliminating Opportunities


Removing autocorrelation by regression
interferes with discovering feedback between
variables.
Data manipulations that tend to make variables
Gaussian


Spatial smoothing
Variables defined by principal components or
averages over ROIs
eliminate or reduce the possibility of taking
advantage of LiNGAM algorithms.
24
III. Simple Limitations


Testing all models (e.g., with LISREL chisquare) is a consistent search method for linear,
Gaussian models (folk theorem).
But it is not feasible except for very small
numbers of variables, e.g., for 8 variables there
are
324 = 22,876,792,454,961
directed graphs.
25
III. Not So Simple Possible Errors: Variables
Defined on ROIs as Proxies for Latent Variables
X
Y
Z
A
B

C
X is independent of Z conditional on Y
But unless B is a perfect measure of Y, A is not independent of C conditional on
B.
So if A, B, and C are taken as “proxies” for X, Y and Z, a regression of C on A
and B will find, correctly, that X has an indirect influence on Z, through Y, but
also, incorrectly, that X has in addition a direct influence on Z not through Y.
26
III. Not So Obvious Errors: Regression


Lots of forms: linear, polynomial, logistic, etc.
All have the following features:



Prior separation of variables into outcome, Y, and a set S of
possible causes, A, B, C, etc. of Y.
Regression estimate of the influence of A on Y is a
measure of the association of A and Y conditional on all
other variables in S.
Regression for causal effects always attempts to estimate
the direct (relative to other variables in S) influence of A on
Y.
27
III. Regression to Estimate
Causal Influence
•
Let V = {X,Y,T}, where
- Y : measured outcome
- measured regressors: X = {X1, X2, …, Xn}
- latent common causes of pairs in X U Y: T = {T1, …, Tk}
•
Let the true causal model over V be a Structural Equation
Model in which each V  V is a linear combination of its
direct causes and independent, Gaussian noise.
28
III. Regression to estimate Causal Influence
Consider the regression equation:
Y = 0 +  1X1 +  2X2 + ..…  nXn
Let the OLS regression estimate i be the estimated
causal influence of Xi on Y.
That is, hypothetically holding X/Xi experimentally
constant,  i is an estimate of the change in E(Y) that
would result from an intervention that changes Xi by 1
unit.
Let the real Causal Influence Xi  Y = bi
When is the OLS estimate i a consistent estimate of bi?
29
III. Regression Will Be “inconsistent” When

1. There is an unrecorded common cause
of Y and Xi
L
Xi
Y

If X, Y are the only measured variables, PC,
GES and FCI cannot determine whether the
influence is from X to Y or from an
unmeasured common cause, or both.
LiNGAM can if the disturbances are nonGaussian.
30
Regression will be “inconsistent”
when
2. Cause and effect are confused:
Xi
Y
“…one region, with a long haemodynamic
latency, could cause a neuronal response
in another that was expressed,
haedynamically, before the source.”
(Friston, et al., 2007, 602). LiNGAM does
not make this error.
3. And that error can lead to others:
Xi
Xk
Y
Regression concludes Xk is
cause of Y. FCI, etc. do not
make these errors.
31
BadTRegression Example
1
X1
X2
T2
X3
Y
True Model
1  0

2  0
X
3  0
X
Multiple Regression Result
PC, GES, FCI get these kinds of cases right.
32
Regression Consistency
•
•
If
Xi is d-separated from Y conditional on X\Xi
in the true graph after removing Xi  Y, and
X contains no descendant of Y, then:
i is a consistent estimate of bi
33
III. Granger Causality
Idea: Time series
X is a Granger cause of Y iff stationary {…..Xt-1; ….Yt-1}
predicts Yt better than does {….Yt-1}
Obvious Generalizations:


Non-Gaussian time series.
Multiple time series—essentially time series version of multiple
regression: X is a Granger cause of Y iff Yt is not independent of …Xt1 conditional on covariates …Zt-1.
Less obvious generalizations:

Non-linear time series (finding conditional independence tests is
touchy)
C. Granger, Econometrica, 1969
34
GC All Over the Place




Goebel, R. Roebroeck, A. Kim, D. and Formisano, E. (2003). Investigating
directed cortical interactions in time-resolved fMI data using vector
autoregressive modeling and Granger causality mapping. Magnetic
Resonance Imaging, 21: 125-161.
Chen, Y. Bressler, S.L., Knuth, K.H., Truccolo, W.A., Ding, M.Z., (2006).
Stochastic modeling of neurobiological time series: power, coherence,
Granger causality, and separation of evoked responses from ongoing
activity. Chaos 16, 26-113.
Brovelli, A., Ding, M.Z., Ledberg, A., Chen, Y.H., Nakamura, R., Bressler,S.L.,
(2004). Beta oscillations in a large-scale sensorimotor cortical network:
directional influences revealed by granger causality. Proc. Natl. Acad. Sci. U.
S. A. 101: 9849–9854.
Deshpande, G., Hu, ., Stilla, R, and K. Sathian, (2008) Effective connectivity
during haptic perception: A study using Granger causality analysis of
functional magnetic resonance imaging data. NeuroImage, 40: 1807-1814.
35
III. Problems with GC
•
fMRI series with multiple conditions are not stationary
--May not always be serious.
•
GC can produce causal errors when there is measurement
error or unmeasured confounding series.
–
--Open research problem: find a consistent method
to identify unrecorded common causes of time series, akin
to Silva, et al., JMLR 2006 for equilibrium data; Glymour
and Spirtes, J. of Econometrics, 1988.
36
III. If Xt records an event occurring later
than Yt+1, X may be mistakenly taken to
be a cause of Y. (Friston, 2007, again.)
•
•
This is a problem for regression;
Not a problem if PC, FCI, GES or LiNGAM are used in
estimating the “Structural VAR”
because they do not require a separation of variables into
outcome and potential cause, or a time ordering of
variables.
37
III. Granger Causality and Mechanisms

Neural signals occur faster than fMRI
sampling rate—what is going on in
between?
Granger Causes ARE:
X1
Y1
Z1
W1
X2
Y2
Z2
W2
X3
Y3
Z3
W3
W
X
Unobserved
X4
Y4
Z4
Y
Z
Spurious
edges
W4
38
III. Analysis of Residuals

Regress and apply PC, etc. to residuals
Regress on X1, Y1, Z1, W1;
X1
Y1
Z1
W1
X2
Y2
Z2
W2
X3
Y3
Z3
W3
W
X
Unobserved
X4
Y4
Z4
Y
Z
W4
Swanson and Granger, JASA; Demiralp and
Hoover (2003), Oxford Economic Bulletin
39
Conclusion





Causal inference from imaging data is about as hard as it
gets;
Conventional statistical procedures are radically insufficient
tools;
Lots of unused potentially relevant, principled, tools in the
Machine Learning literature;
Measurement methods and data transformations can alter
the probability distributions in destructive ways;
Graphical causal models are the best available tool for
thinking about the statistical constraints that causal
hypotheses imply.
40
Things There Aren’t:
Magic Wands
Pixie Dust
41
If You Forget Everything Else in
This Talk, Remember This:







P. Spirtes, et al., Causation, Prediction and
Search, Springer Lecture Notes in Statistics, 2nd
edition MIT Press, 2000
J. Pearl, Causality, Oxford, 2000.
Uncertainty in Artificial Intelligence Annual
Conference Proceedings
Journal of Machine Learning Research
Peter Spirtes’ webpage
Judea Pearl’s web page.
The TETRAD webpage.
42
Download