Thursday slides - Andrew.cmu.edu

advertisement
CAUSAL SEARCH IN THE
REAL WORLD
A menu of topics

Some real-world challenges:
 Convergence
& error bounds
 Sample selection bias
 Simpson’s paradox

Some real-world successes:
 Learning
based on more than just independence
 Learning about latents & their structure
Short-run causal search

Bayes net learning algorithms can give the wrong
answer if the data fail to reflect the “true”
associations and independencies
 Of
course, this is a problem for all inference: we might
just be really unlucky
 Note: This is not (really) the problem of
unrepresentative samples (e.g., black swans)
Convergence in search

In search, we would like to bound our possible error
as we acquire data
 I.e.,
we want search procedures that have uniform
convergence

Without uniform convergence,
 Cannot
set confidence intervals for inference
 Not every Bayesian, regardless of priors over
hypotheses, agrees on probable bounds, no matter how
loose
Pointwise convergence


Assume hypothesis H is true
Then
 For
any standard of “closeness” to H, and
 For any standard of “successful refutation,”
 For every hypothesis that is not “close” to H, there is a
sample size for which that hypothesis is refuted
Uniform convergence


Assume hypothesis H is true
Then
 For
any standard of “closeness” to H, and
 For any standard of “successful refutation,”
 There is a sample size such that for all hypotheses H*
that are not “close” to H, H* is refuted at that sample
size.
Two theorems about convergence

There are procedures that, for every model,
pointwise converge to the Markov equivalence class
containing the true causal model. (Spirtes, Glymour, &
Scheines, 1993)

There is no procedure that, for every model,
uniformly converges to the Markov equivalence
class containing the true causal model. (Robins,
Scheines, Spirtes, & Wasserman, 1999; 2003)
Two theorems about convergence


What if we didn’t care about “small” causes?
ε-Faithfulness: If X & Y are d-connected given S,
then ρXY.S >ε
 Every

association predicted by d-connection is ≥ε
For anyε, standard constraint-based algorithms are
uniformly convergent given ε-Faithfulness
 So
we have error bounds, confidence intervals, etc.
Sample selection bias

Sometimes, a variable of interest is a cause of
whether people get in the sample
 E.g.,
measure various skills or knowledge in college
students
 Or measuring joblessness by a phone survey during the
middle of the day

Simple problem: You might get a skewed picture of
the population
Sample selection bias

If two variables matter, then we have:
Factor A
Factor B
Sample
 Sample
= 1 for everyone we measure
 That is equivalent to conditioning on Sample
 ⇒ Induces an association between A and B!
Simpson’s Paradox

Consider the following data:
Men
Treated Untreated
Alive
3
20
Dead
3
24
P(A | T) = 0.5
P(A | U) = 0.45…
Women
Treated Untreated
Alive
16
3
Dead
25
6
P(A | T) = 0.39
P(A | U) = 0.333
Treatment is superior in both groups!
Simpson’s Paradox

Consider the following data:
Pooled
In the “full”
population,
you’re better
off not being
Treated!
Treated Untreated
Alive
19
23
Dead
28
30
P(A | T) = 0.404
P(A | U) = 0.434
Simpson’s Paradox

Berkeley Graduate Admissions case
More than independence

Independence & association can reveal only the
Markov equivalence class
 But

our data contain more statistical information!
Algorithms that exploit this additional info can
sometimes learn more (including unique graphs)
 Example:
LiNGaM algorithm for non-Gaussian data
Non-Gaussian data


Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:
D=BD+ε
 where
B is permutable to lower triangular (because
graph is acyclic)
Non-Gaussian data


Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:
D = Aε
 where
A = (I – B)-1
Non-Gaussian data


Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:
D = Aε
 where


A = (I – B)-1
ICA is an efficient estimator for A
⇒ Efficient causal search that reveals direction!
C
⟶ E iff non-zero entry in A
Non-Gaussian data

Why can we learn the directions in this case?
Gaussian noise
A
B
A
B
Uniform noise
Non-Gaussian data

Case study: European electricity cost
Learning about latents

Sometimes, our real interest…
 is
in variables that are only indirectly observed
 or observed by their effects
 or unknown altogether but influencing things behind the
scenes
Sociability
General IQ
Test score
Other
factors
Math skills
Reading level
Size of
social network
Factor analysis


Assume linear equations
Given some set of (observed) features, determine
the coefficients for (a fixed number of) unobserved
variables that minimize the error
Factor analysis
If we have one factor, then we find coefficients to
minimize error in:
Fi = ai + biU
where U is the unobserved variable (with fixed mean
and variance)


Two factors ⇒ Minimize error in:
Fi = ai + bi,1U1 + bi,2U2
Factor analysis


Decision about exactly how many factors to use is
typically based on some “simplicity vs. fit” tradeoff
Also, the interpretation of the unobserved factors
must be provided by the scientist
 The
data do not dictate the meaning of the unobserved
factors (though it can sometimes be “obvious”)
Factor analysis as graph search

One-variable factor analysis is equivalent to finding
the ML parameter estimates for the SEM with graph:
U
F1
F2 … Fn
Factor analysis as graph search

Two-variable factor analysis is equivalent to finding
the ML parameter estimates for the SEM with graph:
U1
F1
F2 … Fn
U2
Better methods for latents

Two different types of algorithms:
Determine which observed variables are caused by
shared latents
1.

Determine the causal structure among the latents
2.


BPC, FOFC, FTFC, …
MIMBuild
Note: need additional parametric assumptions

Usually linearity, but can do it with weaker info
Discovering latents

Key idea: For many parameterizations, association
between X & Y can be decomposed
 Linearity

⇒
⇒ can use patterns in the precise associations to
discover the number of latents
 Using
the ranks of different sub-matrices
Discovering latents
U
A
B
C
D
Discovering latents
U
A
L
B
C
D
Discovering latents


Many instantiations of this type of search for
different parametric knowledge, # of observed
variables (⇒ # of discoverable latents), etc.
And once we have one of these “clean” models, can
use “traditional” search algorithms (with modifications) to
learn structure between the latents
Other Algorithms





CCD: Learn DCG (with non-obvious semantics)
ION: Learn global features from overlapping local
sets (including between not co-measured variables)
SAT-solver: Learn causal structure (possibly cyclic,
possibly with latents) from arbitrary combinations of
observational & experimental constraints
LoSST: Learn causal structure while that structure
potentially changes over time
And lots of other ongoing research!
Tetrad project

http://www.phil.cmu.edu/projects/tetrad/current.html
Download