Ockham’s Razor in Causal Discovery: A New Explanation

advertisement
Ockham’s Razor in
Causal Discovery: A New
Explanation
Kevin T. Kelly
Conor Mayo-Wilson
Department of Philosophy
Joint Program in Logic and Computation
Carnegie Mellon University
www.hss.cmu.edu/philosophy/faculty-kelly.php
I. Prediction vs. Policy
Predictive Links
Correlation
or co-dependency allows one to predict
Y from X.
Lung cancer
Ash trays
Linked to
Lung cancer!
Ash trays
scientist policy maker
Policy
Policy
manipulates X to achieve a change in Y.
Lung cancer
Ash trays
Linked to
Lung cancer!
Ash trays
Prohibit
ash trays!
Policy
manipulates X to achieve a change in Y.
Lung cancer
Policy
We failed!
Ash trays
Correlation is not Causation
Manipulation
of X can destroy the correlation of X
Lung cancer
with Y.
We failed!
Ash trays
Standard Remedy
Lung cancer
Randomized
controlled study
That’s what happens
if you carry out the
policy.
Ash trays
Infeasibility
Expense
Morality
IQ
Let me force a
few thousand children
to eat lead.
Lead
Infeasibility
Expense
Morality
IQ
Just joking!
Lead
Ironic Alliance
Ha! You will never prove that
lead affects IQ…
IQ
industry
Lead
Ironic Alliance
IQ
And you can’t throw my people
out of work on a mere whim.
Lead
Ironic Alliance
IQ
So I will keep on polluting, which will
never settle the matter because it is not
a randomized trial.
Lead
II. Causes From Correlations
Causal Discovery
Patterns
of conditional correlation can imply
unambiguous causal conclusions
(Pearl,
Spirtes, Glymour, Scheines, etc.)
Protein A
Protein C
Protein B
Cancer protein
Eliminate protein C!
Basic Idea


Causation is a directed, acyclic network over
variables.
What makes a network causal is a relation of
compatibility between networks and joint
probability distributions.
X
Y
Z
G
compatibility
Z
Y
p
X
Compatibility
Joint distribution p is compatible with directed,
acyclic network G iff:
Causal Markov Condition: each variable X is
independent of its non-effects given its immediate
causes.
V
V Y Z
Faithfulness Condition: every conditional
independence relation that holds in p is a
consequence of the Causal Markov Cond.
X
W
Common Cause
•B yields info about C (Faithfulness);
•B yields no further info about C given A (Markov).
A
A
B
B
C
C
Causal Chain
•B yields info about C (Faithfulness);
•B yields no further info about C given A (Markov).
B
B
A
A
C
C
Common Effect
•B yields no info about C (Markov);
•B yields extra info about C given A (Faithfulness).
B
C
B
A
C
A
Distinguishability
indistinguishable
B
distinctive
C
A
A
C
A
B
C
B
B
C
A
Immediate Connections
•There is an immediate causal connection between
X and Y iff
X is dependent on Y given every subset of
variables not containing X and Y (Spirtes, Glymour and
Scheines)
Z
X
W
Y
Some conditioning
set breaks dependency
X
Y
No intermediate
conditioning set
breaks dependency
Recovery of Skeleton
•Apply preceding condition to recover every nonoriented immediate causal connection.
X
Y
Y
Z
truth
X
Y
Y
Z
skeleton
Orientation of Skeleton
•Look for the distinctive pattern of common
effects.
Common effect
X
Y
Y
Z
truth
X
Y
Y
Z
Orientation of Skeleton
•Look for the distinctive pattern of common
effects.
•Draw all deductive consequences of these
orientations.
Common effect
X
Y
Y
Z
truth
X
Y
Y
Z
Y is not common
effect of ZY
So orientation
must be downward
Causation from Correlation
The
following network is causally unambiguous
if all variables are observed.
Protein A
Protein C
Protein B
Cancer protein
Causation from Correlation
The
red arrow is also immune to latent
confounding causes
Protein A
Protein C
Protein B
Cancer protein
Brave New World for Policy
Experimental
(confounder-proof) conclusions
from correlational data!
Protein A
Protein C
Protein B
Cancer protein
Eliminate protein C!
III. The Catch
Metaphysics vs. Inference


The above results all assume that the true
statistical independence relations for p are given.
But they must be inferred from finite samples.
Sample
Inferred
statistical
dependencies
Causal
conclusions
Problem of Induction

Independence is indistinguishable from
sufficiently small dependence at sample size n.
data
dependence
independence
Bridging the Inductive Gap


Assume conditional independence until the data
show otherwise.
Ockham’s razor: assume no more causal
complexity than necessary.
Inferential Instability


No guarantee that small dependencies will not be
detected later.
Can have spectacular impact on prior causal
conclusions.
Current Policy Analysis
Protein A
Protein C
Cancer protein
Protein B
Eliminate protein C!
As Sample Size Increases…
Protein A
weak
Protein C
Cancer protein
Protein B
Protein D
Rescind that order!
As Sample Size Increases Again…
Protein A
weak
Protein B
Protein E
weak
Protein C
Cancer protein
weak
Protein D
Eliminate protein C again!
As Sample Size Increases Again…
Protein A
weak
Protein E
weak
Protein C
Cancer protein
weak
Protein B
Etc.
Protein D
Eliminate protein C again!
Typical Applications

Linear Causal Case: each variable X is a linear
function of its parents and a normally
distributed hidden variable called an “error
term”. The error terms are mutually
independent.

Discrete Multinomial Case: each variable X
takes on a finite range of values.
An Optimistic Concession

No unobserved latent confounding causes
Genetics
Smoking
Cancer
Causal Flipping Theorem

No matter what a consistent causal discovery
procedure has seen so far, there exists a pair G, p
satisfying the above assumptions so that the
current sample is arbitrarily likely in p and the
procedure produces arbitrarily many opposite
conclusions in p about an arbitrary causal arrow in
G as sample size increases.
oops
I meant
oops
I meant
oops
I meant
Causal Flipping Theorem


Every consistent causal inference method is
covered.
Therefore, multiple instability is an intrinsic
feature of the causal discovery problem.
oops
I meant
oops
I meant
oops
I meant
The Crooked Course
"Living in the midst of ignorance and considering
themselves intelligent and enlightened, the
senseless people go round and round, following
crooked courses, just like the blind led by the
blind." Katha Upanishad, I. ii. 5.
Extremist Reaction

Since causal discovery cannot lead straight to
the truth, it is not justified.
I must remain silent.
Therefore, I win.
Moderate Reaction

Many explanations have been offered to make
sense of the here-today-gone-tomorrow nature
of medical wisdom — what we are advised with
confidence one year is reversed the next — but
the simplest one is that it is the natural rhythm
of science.

(Do We Really Know What Makes us Healthy?, NY Times
Magazine, Sept. 16, 2007).
Skepticism Inverted




Unavoidable retractions are justified because
they are unavoidable.
Avoidable retractions are not justified because
they are avoidable.
So the best possible methods for causal
discovery are those that minimize causal
retractions.
The best possible means for finding the truth
are justified.
Larger Proposal

The same holds for Ockham’s razor in general
when the aim is to find the true theory.
IV. Ockham’s Razor
Which Theory is Right?
???
Ockham Says:
Choose the
Simplest!
But Why?
Gotcha!
Puzzle

An indicator must be sensitive to what it
indicates.
simple
Puzzle

An indicator must be sensitive to what it
indicates.
complex
Puzzle

But Ockham’s razor always points at
simplicity.
simple
Puzzle

But Ockham’s razor always points at
simplicity.
complex
Puzzle

How can a broken compass help you find
something unless you already know where it
is?
complex
Standard Accounts
1. Prior Simplicity Bias
Bayes, BIC, MDL, MML, etc.
2. Risk Minimization
SRM, AIC, cross-validation, etc.
1. Bayesian Account



Ockham’s razor is a feature of one’s
personal prior belief state.
Short run: no objective connection with
finding the truth (flipping theorem applies).
Long run: converges to the truth, but other
prior biases would also lead to
convergence.
2. Risk Minimization Acct.



Risk minimization is about prediction
rather than truth.
Urges using a false causal theory rather
than the known true theory for predictive
purposes.
Therefore, not suited to exact science or to
practical policy applications.
V. A New Foundation for
Ockham’s Razor
Connections to the Truth

Short-run Reliability


Too strong to be feasible
when theory matters.
Long-run Convergence

Too weak to single out
Ockham’s razor
Simple
Complex
Simple
Complex
Middle Path

Short-run Reliability


“Straightest” convergence


Too strong to be feasible
when theory matters.
Simple
Simple
Complex
Complex
Just right?
Long-run Convergence

Too weak to single out
Ockham’s razor
Simple
Complex
Empirical Problems


Set K of infinite input sequences.
Partition of K into alternative theories.
K
T1
T2
T3
Empirical Methods

Map finite input sequences to theories or to “?”.
T3
K
T1
e
T2
T3
Method Choice
Output history
T1
T2
T3
e1
e2
e3
Input history
e4
At each stage, scientist
can choose a new
method (agreeing with
past theory choices).
Aim: Converge to the Truth
T3 ? T2 ? T1 T1 T1 T1 T1 T1 T1
K
T1
T2
T3
...
Retraction

Choosing T and then not choosing T next
T
T’
?
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Delays to
Retractions
theory
Aim: Eliminate Needless Delays to
Retractions
application
application
application
application
applicationcorollary
theory
application
application
corollary
application
corollary
Why Timed Retractions?
Retraction minimization =
generalized significance level.
Retraction time minimization =
generalized power.
Easy Retraction Time Comparisons
Method 1
Method 2
T1
T1
T2
T2
T2
T2
T4
T4
T4
...
T1
T1
T2
T2
T2
T3
T3
T4
T4
...
at least as many
at least as late
Worst-case Retraction Time Bounds
(1, 2, ∞)
...
...
T1
T2
T3
T3
T3
T3
T4
...
T1
T2
T3
T3
T3
T4
T4
...
T1
T2
T3
T3
T4
T4
T4
...
T1
T2
T3
T4
T4
T4
T4
...
Output sequences
Curve Fitting

Data = open intervals around Y at rational
values of X.
Curve Fitting

No effects:
Curve Fitting

First-order effect:
Curve Fitting

Second-order effect:
Ockham
There yet?
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham
There yet?
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham
There yet?
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham
There yet?
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham Violation
There yet?
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham Violation
I know you’re coming!
Cubic
Linear
Constant
Quadratic
Ockham Violation
Maybe.
Cubic
Linear
Constant
Quadratic
Ockham Violation
!!!
Hmm, it’s quite nice here…
Cubic
Linear
Constant
Quadratic
Ockham Violation
You’re back!
Learned your lesson?
Cubic
Linear
Constant
Quadratic
Violator’s Path
See, you shouldn’t run ahead
Even if you are right!
Cubic
Linear
Constant
Quadratic
Ockham Path
Cubic
Linear
Constant
Quadratic
More General Argument Required

Cover case in which demon has
branching paths (causal discovery)
More General Argument Required

Cover case in which scientist lags
behind (using time as a cost)
Come on!
Empirical Effects
Empirical Effects
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Effects
May take arbitrarily long to discover
But can’t be taken back
Empirical Theories

True theory determined by which effects appear.
Empirical Complexity
More complex
Background Constraints
More complex
Background Constraints
More complex
Ockham’s Razor

Don’t select a theory unless it is uniquely
simplest in light of experience.
Weak Ockham’s Razor

Don’t select a theory unless it among the
simplest in light of experience.
Stalwartness

Don’t retract your answer while it is uniquely
simplest
Stalwartness

Don’t retract your answer while it is uniquely
simplest
Timed Retraction Bounds

r(M, e, n) = the least timed retraction bound
covering the total timed retractions of M along
input streams of complexity n that extend e
M
...
Empirical Complexity
0
1
2
3
...
Efficiency of Method M at e
M converges to the truth no matter what;
 For each convergent M’ that agrees with M
up to the end of e, and for each n:

 r(M,
e, n)  r(M’, e, n)
M
M’
...
Empirical Complexity
0
1
2
3
...
M is Beaten at e

There exists convergent M’ that agrees with
M up to the end of e, such that
each n, r(M, e, n)  r(M’, e, n);
 Exists n, r(M, e, n) > r(M’, e, n).
 For
M
M’
...
Empirical Complexity
0
1
2
3
...
Ockham Efficiency Theorem

Let M be a solution. The following are
equivalent:
M is always strongly Ockham and stalwart;
 M is always efficient;
 M is never weakly beaten.

Example: Causal Inference

Effects are conditional statistical dependence
relations.
X dep Y | {Z}, {W}, {Z,W}
...
Y dep Z | {X}, {W}, {X,W}
...
X dep Z | {Y},
{Y,W}
Causal Discovery = Ockham’s Razor
X
Y
Z
W
Ockham’s Razor
X
Y
X dep Y | {Z}, {W}, {Z,W}
Z
W
Causal Discovery = Ockham’s Razor
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y},
{Y,W}
Z
W
Causal Discovery = Ockham’s Razor
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z
W
Causal Discovery = Ockham’s Razor
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z dep W| {X}, {Y}, {X,Y}
Y dep W|
{Z}, {X,Z}
Z
W
Causal Discovery = Ockham’s Razor
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z dep W| {X}, {Y}, {X,Y}
Y dep W| {X}, {Z}, {X,Z}
Z
W
IV. Simplicity Defined
Approach
 Empirical
complexity reflects nested
problems of induction posed by the
problem.
 Hence, simplicity is problem-relative
but topologically invariant.
Empirical Problems


Set K of infinite input sequences.
Partition Q of K into alternative theories.
K
T1
T2
T3
Simplicity Concepts

A simplicity concept for (K, Q) is just a wellfounded order < on a partition S of K with
ascending chains of order type not exceeding
omega such that:
1. Each element of S is included in some answer
in Q.
2. Each downward union in (S, <) is closed;
3. Incomparable sets share no boundary point.
4. Each element of S is included in the boundary
of its successor.
Empirical Complexity Defined




Let K|e denote the set of all possibilities
compatible with observations e.
Let (S, <) be a simplicity concept for (K|e,
Q).
Define c(w, e) = the length of the longest <
path to the cell of S that contains w.
Define c(T, e) = the least c(w, e) such that T
is true in w.
Applications



Polynomial laws: complexity = degree
Conservation laws: complexity = particle
types – conserved quantities.
Causal networks: complexity = number of
logically independent conditional
dependencies entailed by faithfulness.
General Ockham Efficiency
Theorem

Let M be a solution. The following are
equivalent:
M is always strongly Ockham and stalwart;
 M is always efficient;
 M is never beaten.

Conclusions



Causal truths are necessary for counterfactual
predictions.
Ockham’s razor is necessary for staying on the
straightest path to the true theory but does not
point at the true theory.
No evasions or circles are required.
Future Directions





Extension of unique efficiency theorem to
stochastic model selection.
Latent variables as Ockham conclusions.
Degrees of retraction.
Pooling of marginal Ockham conclusions.
Retraction efficiency assessment of MDL, SRM.
Suggested Reading



"Ockham’s Razor, Truth, and Information", in
Handbook of the Philosophy of Information, J. van Behthem
and P. Adriaans, eds., to appear.
"Ockham’s Razor, Empirical Complexity, and
Truth-finding Efficiency", Theoretical Computer Science,
383: 270-289, 2007.
Both available as pre-prints at:
www.hss.cmu.edu/philosophy/faculty-kelly.php
1. Prior Simplicity Bias
The simple theory is more
plausible now because it was
more plausible yesterday.
More Subtle Version
Simple
data are a miracle in the complex
theory but not in the simple theory.
Regularity: retrograde motion of Venus at solar conjunction
Has to be!
P
C
However…

e would not be a miracle given P(q);
Why not this?
P
C
The Real Miracle
Ignorance about model:
p(C)  p(P);
+ Ignorance about parameter setting:
p’(P(q) | P)  p(P(q’ ) | P).
= Knowledge about C vs. P(q):
p(P(q)) << p(C).
Lead into gold.
Perpetual motion.
Free lunch.
CP
q
q
q
q
q
q
q
q
Sounds good!
Standard Paradox of Indifference
Ignorance of red vs. not-red
+ Ignorance over not-red:
= Knowledge about red vs. white.
Knognorance =
All the priveleges of knowledge
With none of the responsibilities Sounds good!
q
q
The Ellsberg Paradox
1/3
?
?
Human Preference
1/3
?
a
a
c
?
>
bb
<
b
c
Human View
1/3
?
?
knowledge
a
ignorance
>
ignorance
a
c
bb
knowledge
<
b
c
Bayesian “Rationality”
1/3
?
?
knognorance
a
knognorance
>
knognorance
a
c
>
bb
knognorance
b
c
In Any Event
The coherentist foundations of Bayesianism have
nothing to do with short-run truthconduciveness.
Not so loud!
Bayesian Convergence

Too-simple theories get shot down…
Updated
opinion
Theories
Complexity
Bayesian Convergence

Plausibility is transferred to the next-simplest
theory…
Updated
opinion
Plink!
Blam!
Complexity
Theories
Bayesian Convergence

Plausibility is transferred to the next-simplest
theory…
Updated
opinion
Plink!
Blam!
Complexity
Theories
Bayesian Convergence

Plausibility is transferred to the next-simplest
theory…
Updated
opinion
Plink!
Blam!
Complexity
Theories
Bayesian Convergence

The true theory is never shot down.
Updated
opinion
Zing!
Blam!
Complexity
Theories
Convergence

But alternative strategies also converge:
 Any theory choice in the short run is compatible
with convergence in the long run.
Summary of Bayesian Approach

Prior-based explanations of Ockham’s razor are
circular and based on a faulty model of ignorance.

Convergence-based explanations of Ockham’s
razor fail to single out Ockham’s razor.
2. Risk Minimization

Ockham’s razor minimizes expected distance
of empirical estimates from the true value.
Truth
Unconstrained Estimates

are Centered on truth but spread around it.
Pop!
Pop!
Pop!
Pop!
Unconstrained
aim
Constrained Estimates

Off-center but less spread.
Truth
Clamped aim
Constrained Estimates


Off-center but less spread
Overall improvement in expected distance
from truth…
Pop!
Pop!
Pop!
Pop!
Truth
Clamped aim
Doesn’t Find True Theory

The theory that minimizes estimation risk can be
quite false…
Four eyes!
Clamped aim
Makes Sense
…when loss of an answer is similar in nearby
distributions.
Close is
good
enough!
Loss
p
Similarity
But Not When Truth Matters
…i.e., when loss of an answer is discontinuous with
similarity.
Loss
Close is no cigar!
p
Similarity
Download