Course files

advertisement
Course files
http://www.andrew.cmu.edu/~ddanks/NASSLLI/
PRINCIPLES UNDERLYING
CAUSAL SEARCH ALGORITHMS
Fundamental problem

As we have all heard many times…
“Correlation is not causation!”
Fundamental problem

Why is this slogan correct?
 Causal
hypotheses make implicit claims about the
effects of intervening (manipulating) one or more
variables
 Hypotheses about association or correlation make no
such claims
 Correlation
many ways
or probabilistic dependence can be produced in
Fundamental problem

Some of the possible reasons why X and Y might be
associated are:
 Sheer
chance
 X causes Y
 Y causes X
 Some third variable Z influences X and Y
 The value of X (or a cause of X) and the value of Y (or
a cause of Y) can be causes/reasons for whether an
individual is in the sample (sample selection bias)
Fundamental problem

Fundamental problem of causal search:
 For
any particular set of data,
there are often many different causal structures that
could have produced that data
 Causation
many
→ Association map is
→
one
Fundamental problem

Okay, so what can we do about this?
 Use
the data to figure out as much as possible (though
it usually won’t be everything)
 Requires
 And
developing search procedures
then try to narrow the possibilities
 Use
other knowledge (e.g., time order, interventions)
 Get better / different data (e.g., run an experiment)
Always remember…
Even if we cannot discover
the whole truth,
we might be able to find
some of the truth!
Markov equivalence

Formally, we say that:
 Two
causal graphs are members of the same Markov
Equivalence Class iff
they imply the exact same (un)conditional
independence relations among the observed variables
 By
the Markov and Faithfulness assumptions
 Remember
that d-separation gives a purely graphical
criterion for determining all of the (un)conditional
independencies
Markov equivalence

The “Fundamental Problem of Causal Inference”
can be restated as:
 For
some sets of independence relations, the Markov
equivalence class is not a singleton

Markov equivalence classes give a precise
characterization of what can be inferred from
independencies alone
Markov equivalence

Examples:
Y
X
X
{Y, Z} ⇒
Y|Z
⇒
X
X
Z
Z
Y
Y
X
X
Z
X
Y
⇒
Y
Y
X
Z
Y
X
Z
Z
Markov equivalence

Two more examples:
 Are
these graphs Markov equivalent?
Y
X
Y
X
Z
 Are
Z
these two graphs?
Y
X
Y
X
Z
Z
Shared structure

What is shared by all of the graphs in a Markov
equivalence class?
 Same
 I.e.,
 Same
 I.e.,
“skeleton”
they all have the same adjacency relations
“unshielded colliders”
X → Y ← Z with no edge between X and Z
 Sometimes,
 In
other edges have same direction
these last two cases, we can infer that the true graph
contains the shared directed edges.
Shared structure as patterns

Since every Markov equivalent graph has the same
adjacencies, we can represent the whole class using
a pattern
A
pattern is itself a graph, but the edges represent
edges in other graphs
Shared structure as patterns

A pattern can have directed and undirected edges
 It
represents all graphs that can be created by adding
arrowheads to the undirected edges without creating
either: (i) a cycle; or (ii) an unshielded collider

Let’s try some examples…
Shared structure as patterns
Nitrogen — PlantGrowth — Bees
Nitrogen → PlantGrowth → Bees
Nitrogen ← PlantGrowth → Bees
Nitrogen ← PlantGrowth ← Bees
Shared structure as patterns
Nitrogen → PlantGrowth ← Bees
Nitrogen → PlantGrowth ← Bees
Formal problem of search

Given some dataset D, find:
 Markov
equivalence class, represented as a pattern P,
that predicts exactly the independence relations found
in the data

More colloquially, find the causal graphs that could
have produced data like this
Hard to find a pattern


“Gee, how hard could this be? Just test all of the
associations, find the Markov equivalence class, then
write down the pattern for it. Voila! We’re doing
causal learning!”
Big problem: # of independencies to test is superexponential in # of variables:



2 variables ⇒ 1 test
3 variables ⇒ 6 tests
4 variables ⇒ 24 tests
5 variables ⇒ 80 tests
6 variables ⇒ 240 tests
and so on…
General features of causal search

Huge model and parameter spaces
Even when we (necessarily) use prior information about the
family of probability distributions.
 Relevant statistics must be rapidly computed


But substantive knowledge about the domain may
restrict the space of alternative models
Time order of variables
 Required cause/effect relationships
 Existence or non-existence of latent variables

Three schemata for search

Bayesian / score-based
 Find

the graph(s) with highest P(graph | data)
Constraint-based
 Find
the graph(s) that predict exactly the observed
associations and independencies

Combined
 Get
“close” with constraint-based, and then find the
best graph using score-based
Bayesian / score-based

Informally:
Give each model an initial score using “prior beliefs”
 Update each score based on the likelihood of the data if
the model were true
 Output the highest-scoring model


Formally:
Specify P(M, v) for all models M and possible parameter
values v of M
 For any data D, P(D | M, v) can easily be calculated
 P(M | D) ∝ ⎰v P(D | M, v)P(M, v)

Bayesian / score-based

In practice, this strategy is completely
computationally intractable
 There

are too many graphs to check them all
So, we use a greedy search strategy
 Start
with an initial graph
 Iteratively compare the current graph’s score
(∝ posterior probability) with that of each 1- or 2-step
modification of that graph
 By
edge addition, deletion or reversal
Bayesian / score-based

Problem #1: Local maxima
 Often,

greedy searches get stuck
Solution:
 Greedy
search over Markov equivalence classes, rather
than graphs (Meek)
 Has
a proof of correctness and convergence (Chickering)
 But it gets to the right answer slowly
Bayesian / score-based

Problem #2: Unobserved variables
 Huge
number of graphs
 Huge number of different parameterizations
 No fast, general way to compute likelihoods from latent
variable models

Partial solution:
 Focus
on a small, “plausible” set of models for which we
can compute scores
Constraint-based

Implementation of the earlier idea
 “Build”
the Markov equivalence class that predicts the
pattern of association actually found in the data
 Compatible
with a variety of statistical techniques
 Note that we might have to introduce a latent variable to
explain the pattern of statistics
 Important
constraints on search:
 Minimize
the number of statistical tests
 Minimize the size of the conditioning sets
(Why?)
Constraint-based

Algorithm step #1: Discover the adjacencies
 Create
the complete graph with undirected edges
 Test all pairs X, Y for unconditional independence
 Remove
 Test
all adjacent X, Y for independence given single N
 Remove
 Test
…
X—Y edge if they are independent
X—Y edge if they are independent
adjacent pairs given two neighbors
Constraint-based

Algorithm step #2: (Try to) Orient edges
 “Unshielded
triple”: X — C — Y, but X, Y not adjacent
 If X & Y independent given S containing C, then C must
be a non-collider
 Since
we have to condition on it to achieve d-separation
 If
X & Y independent given S not containing C, then C
must be a collider
 Since
 And
the path is not active when not conditioning on C
then do further orientations to ensure acyclicity and
nodes being non-colliders
Constraint-based example


Variables are {X, Y, Z, W}
Only independencies are:
X
X
Y
Y
W|Z
W|Z
Constraint-based example

Step 1: Form the complete graph using undirected
edges
Y
Z
X
W
Constraint-based example

Step 2: For each pair of variables, remove the
edge between them if they’re unconditionally
independent
X
Y
Y
Z
X
W
⇒
Constraint-based example

Step 3: For each adjacent pair, remove the edge if
they’re independent conditional on some variable
adjacent to one of them
{X, Y}
Y
Z
X
W
W|Z ⇒
Constraint-based example

Step 4: Continue removing edges, checking
independence conditional on 2 (or 3, or 4, or…)
variables
Y
Z
X
W
Constraint-based example

Step 5: Orientation
 For
X – Z – Y, since X Y without conditioning on Z,
then make Z a collider
 Since Z is a non-collider between X and W, though, we
must orient Z – W away from Z
Y
Z
X
W
Constraint-based output


Searches that allow for latent variables can also
have edges of the form X o→ Y
This indicates one of three possibilities:
→Y
 At least one unobserved common cause of X and Y
 Both of these
X
Interventions to the rescue?

Interventions helped us solve an earlier equivalence
class problem
 Randomization
meant that:
Treatment-Effect association ⇒ T → E

Interventions alter equivalence classes, but don’t
make them all into singletons
 The
fundamental problem of search remains
Y
X
Y
X
Before X-intervention
Z
Y
X
Y
X
Y
X
Y
X
Y
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
X
X
X
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
X
X
X
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
Z
X
Z
X
Z
X
Z
X
Z
Z
Y
X
Y
X
After X-intervention
Z
Y
X
Y
X
Y
X
Y
X
Y
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
X
X
X
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
X
X
X
X
Z
Z
Z
Z
Z
Z
Y
Y
Y
Y
Y
Y
X
X
Z
X
Z
X
Z
X
Z
X
Z
Z
Search with interventions

Search with interventions is the same as search with
observations, except
 We
adjust the graphs in the search space to account for
the intervention

For multiple experiments, we search for graphs in
every output equivalence class
 More
complicated than this in the real world due to
sampling variation
Example

Observation
Y
Z|X
Y
⇒
X
Y
X
X
Z

Intervention on X
Y
{X, Z}
⇒
Y
X
Z

Only possible graph:
Y
X
Z
Y
Z
Z
Y
&
X
Z
Looking ahead…

Have:
 Basic
formal representation for causation
 Fundamental causal asymmetry (of intervention)
 Inference & reasoning methods
 Search & causal discovery principles

Need:
 Search
& causal discovery methods that work in the
real world
Download