Truth-conduciveness Without Reliability: A Non-Theological Explanation of Ockham’s Razor

advertisement
Truth-conduciveness Without
Reliability:
A Non-Theological Explanation of
Ockham’s Razor
Kevin T. Kelly
Department of Philosophy
Carnegie Mellon University
www.cmu.edu
I. The Puzzle
Which Theory is True?
???
Ockham Says:
Choose the
Simplest!
But Why?
Gotcha!
Puzzle

An indicator must be sensitive to what it
indicates.
simple
Puzzle

An indicator must be sensitive to what it
indicates.
complex
Puzzle

But Ockham’s razor always points at
simplicity.
simple
Puzzle

But Ockham’s razor always points at
simplicity.
complex
Puzzle

If a broken compass is known to point
North, then we already know where North is.
complex
Puzzle

But then who needs the compass?
complex
Proposed Answers
1.
2.
3.
Evasive
Circular
Magical
A. Evasions
Truth
A. Evasions
Truth
Virtues

Simple theories have virtues:
Testable
 Unified
 Explanatory
 Symmetrical
 Bold
 Compress data

Virtues

Simple theories have virtues:
Testable
 Unified
 Explanatory
 Symmetrical
 Bold
 Compress data


But to assume that the truth has these virtues is
wishful thinking. [van Fraassen]
Convergence

At least a simplicity bias doesn’t prevent
convergence to the truth.
truth
Complexity
Convergence

At least a simplicity bias doesn’t prevent
convergence to the truth.
truth
Plink!
Blam!
Complexity
Convergence

At least a simplicity bias doesn’t prevent
convergence to the truth.
truth
Blam!
Plink!
Complexity
Convergence

At least a simplicity bias doesn’t prevent
convergence to the truth.
truth
Plink!
Blam!
Complexity
Convergence

Convergence allows for any theory choice whatever
in the short run, so this is not an argument for
Ockham’s razor now.
truth
Alternative ranking
Overfitting

Empirical estimates based on complex models
have greater expected distance from the truth
Truth
Overfitting

Empirical estimates based on complex models
have greater expected distance from the truth.
Pop!
Pop!
Pop!
Pop!
Overfitting

Empirical estimates based on complex models
have greater expected distance from the truth.
Truth
clamp
Overfitting

Empirical estimates based on complex models
have greater expected distance from the truth.
Pop!
Pop!
Pop!
Pop!
Truth
clamp
Overfitting

...even if the simple theory is known to be false…
Four eyes!
clamp
C. Circles
Prior Probability

Assign high prior probability to simple theories.
Simplicity is plausible
now because it was
yesterday.
Miracle Argument
e would not be a miracle given C;
 e would be a miracle given P.

q
P
C
Miracle Argument
e would not be a miracle given C;
 e would be a miracle given P.

q’
C
S
However…

e would not be a miracle given P(q);
Why not this?
q
C
S
The Real Miracle
Ignorance about model:
p(C)  p(P);
+ Ignorance about parameter setting:
p’(P(q) | P)  p(P(q’ ) | P).
= Knowledge about C vs. P(q):
p(P(q)) << p(C).
CP
q
q
q
q
q
q
q
q
Is it knognorance or
Ignoredge?
The Ellsberg Paradox
1/3
?
?
The Ellsberg Paradox
1/3
?
?
> 1/3
Human betting preferences
>
The Ellsberg Paradox
1/3
?
?
> 1/3
< 1/3
Human betting preferences
>
>
Human View
knowledge
1/3
ignorance
?
?
Human betting preferences
>
>
Bayesian View
ignoredge
1/3
ignoredge
1/3
1/3
Human betting preferences
>
>
Moral
1/3
?
?
Even in the most mundane contexts, when Bayesians
offer to replace our ignorance with ignoredge, we
vote with our feet.
Probable Tracking
1. If the simple theory S were true, then the data
would probably be simple so Ockham’s razor
would probably believe S.
2. If the simple theory S were false, then the
complex alternative theory C would be true, so
the data would probably be complex so you would
probably believe C rather than S.
Probable Tracking
Given that you use Ockham’s razor:
p(B(S) | S) = p(eS | S) = 1.
p(not-B(S) | not-S) = 1 - p(eS | C) = 1.
Probable Tracking
Given that you use Ockham’s razor:
p(B(C) | C) = 1
= probability that the data look simple given C.
p(B(C) | not-C) = 0
= probability that the data look simple given
alternative theory P.
B. Magic
Truth
Simplicity
Magic

Simplicity informs via hidden causes.
G
Leibniz, evolution
Simple
B(Simple)
Kant
Simple
B(Simple)
Ouija board
Simple
B(Simple)
Magic

Simpler to explain Ockham’s razor without
hidden causes.
?
Reductio of Naturalism (Koons 2000)




Suppose that the crucial probabilities p(Tq | T) in the
Bayesian miracle argument are natural chances, so that
Ockham’s razor really is reliable.
Suppose that T is the fundamental theory of natural
chance, so that Tq determines the true pq for some choice
of q.
But if pt(Tq) is defined at all, it should be 1 if t = q and 0
otherwise.
So natural science can only produce fundamental
knowledge of natural chance if there are non-natural
chances.
Diagnosis

Indication or tracking
Simple
Complex
Simple
Complex
Too strong:
 Circles, evasions, or magic required.


Convergence
Too weak
 Doesn’t single out simplicity

Diagnosis

Indication or tracking
Too strong:
 Circles or magic required.


Convergence
Simple
Complex
Simple
Complex
Too weak
 Doesn’t single out simplicity


“Straightest” convergence

Just right?
Simple
Complex
II. Straightest
Convergence
Simple
Complex
Empirical Problems


Set K of infinite input sequences.
Partition of K into alternative theories.
K
T1
T2
T3
Empirical Methods

Map finite input sequences to theories or to “?”.
T3
K
T1
e
T2
T3
Method Choice
Output history
T1
T2
T3
e1
e2
e3
Input history
e4
At each stage, scientist
can choose a new
method (agreeing with
past theory choices).
Aim: Converge to the Truth
T3 ? T2 ? T1 T1 T1 T1 T1 T1 T1
K
T1
T2
T3
...
Retraction

Choosing T and then not choosing T next
T
T’
?
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Retractions
Truth
Aim: Eliminate Needless Delays to
Retractions
theory
Aim: Eliminate Needless Delays to
Retractions
application
application
application
application
applicationcorollary
theory
application
application
corollary
application
corollary
Easy Retraction Time Comparisons
Method 1
Method 2
T1
T1
T2
T2
T2
T2
T4
T4
T4
...
T1
T1
T2
T2
T2
T3
T3
T4
T4
...
at least as many
at least as late
Worst-case Retraction Time Bounds
(1, 2, ∞)
...
...
T1
T2
T3
T3
T3
T3
T4
...
T1
T2
T3
T3
T3
T4
T4
...
T1
T2
T3
T3
T4
T4
T4
...
T1
T2
T3
T4
T4
T4
T4
...
Output sequences
II. Ockham Without Circles,
Evasions, or Magic
Curve Fitting

Data = open intervals around Y at rational
values of X.
Curve Fitting

No effects:
Curve Fitting

First-order effect:
Curve Fitting

Second-order effect:
Empirical Effects
Empirical Effects
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Effects
May take arbitrarily long to discover
Empirical Theories

True theory determined by which effects appear.
Empirical Complexity
More complex
Background Constraints
More complex
Background Constraints
?
More complex
Background Constraints
?
More complex
Ockham’s Razor

Don’t select a theory unless it is uniquely
simplest in light of experience.
Weak Ockham’s Razor

Don’t select a theory unless it among the
simplest in light of experience.
Stalwartness

Don’t retract your answer while it is uniquely
simplest
Stalwartness

Don’t retract your answer while it is uniquely
simplest
Uniform Problems

All paths of accumulating effects starting at a
level have the same length.
Timed Retraction Bounds

r(M, e, n) = the least timed retraction bound
covering the total timed retractions of M along
input streams of complexity n that extend e
M
...
Empirical Complexity
0
1
2
3
...
Efficiency of Method M at e
M converges to the truth no matter what;
 For each convergent M’ that agrees with M
up to the end of e, and for each n:

 r(M,
e, n)  r(M’, e, n)
M
M’
...
Empirical Complexity
0
1
2
3
...
M is Strongly Beaten at e

There exists convergent M’ that agrees with
M up to the end of e, such that
 For
each n, r(M, e, n) > r(M’, e, n).
M
M’
...
Empirical Complexity
0
1
2
3
...
M is Weakly Beaten at e

There exists convergent M’ that agrees with
M up to the end of e, such that
each n, r(M, e, n)  r(M’, e, n);
 Exists n, r(M, e, n) > r(M’, e, n).
 For
M
M’
...
Empirical Complexity
0
1
2
3
...
Idea


No matter what convergent M has done in the
past, nature can force M to produce each
answer down an arbitrary effect path, arbitrarily
often.
Nature can also force violators of Ockham’s
razor or stalwartness either into an extra
retraction or a late retraction in each complexity
class.
Ockham Violation with Retraction
Ockham violation
Extra retraction in each
complexity class
Ockham Violation without Retraction
Ockham violation
Late retraction in each
complexity class
Uniform Ockham Efficiency
Theorem

Let M be a solution to a uniform problem.
The following are equivalent:
M is strongly Ockham and stalwart at e;
 M is efficient at e;
 M is not strongly beaten at e.

Idea

Similar, but if convergent M already violates
strong Ockham’s razor by favoring an
answer T at the root of a longer path,
sticking with T may reduce retractions in
complexity classes reached only along the
longer path.
Violation Favoring Shorter Path
Non-uniform problem
?
Ockham violation
Late or extra retraction in
each complexity class
Violation Favoring Longer Path
without Retraction
Non-uniform problem
?
Ouch!
Extra retraction in each
complexity class!
Ockham violation
But at First Violation…
Non-uniform problem
?
?
?
First Ockham violation
Breaks even each class.
But at First Violation…
Non-uniform problem
?
?
?
First Ockham violation
Breaks even each class.
Loses in class 0 when
truth is red.
Ockham Efficiency Theorem

Let M be a solution. The following are
equivalent:
M is always strongly Ockham and stalwart;
 M is always efficient;
 M is never weakly beaten.

Application: Causal Inference

Causal graph theory: more correlations  more causes.
partial correlations
S
G(S)

Idealized data = list of conditional dependencies
discovered so far.

Anomaly = the addition of a conditional dependency to
the list.
Causal Path Rule

X, Y are dependent conditional on set S of
variables not containing X, Y iff X, Y are
connected by at least one path in which:
no non-collider is in S and
 each collider has a descendent in S.

X
Y
S
[Pearl, SGS]
Forcible Sequence of Models
X
Y
Z
W
Forcible Sequence of Models
X
Y
X dep Y | {Z}, {W}, {Z,W}
Z
W
Forcible Sequence of Models
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y},
{Y,W}
Z
W
Forcible Sequence of Models
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z
W
Forcible Sequence of Models
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z dep W| {X}, {Y}, {X,Y}
Y dep W|
{Z}, {X,Z}
Z
W
Forcible Sequence of Models
X
Y
X dep Y | {Z}, {W}, {Z,W}
Y dep Z | {X}, {W}, {X,W}
X dep Z | {Y}, {W}, {Y,W}
Z dep W| {X}, {Y}, {X,Y}
Y dep W| {X}, {Z}, {X,Z}
Z
W
Policy Prediction



Consistent policy estimator can
be forced into retractions.
“Failure of uniform consistency”.
No non-trivial confidence
interval.
[Robins, Wasserman, Zhang]
Y
Z
Y
Z
Y
Z
Y
Z
Moral



Not true model vs. prediction.
Issue: actual vs. counterfactual
model selection and prediction.
In counterfactual prediction,
form of model matters and
retractions are unavoidable.
Y
Z
Y
Z
Y
Z
Y
Z
IV. Simplicity
Aim
 General
definition of simplicity.
 Prove Ockham efficiency theorem for
general definition.
Approach
 Empirical
complexity reflects nested
problems of induction posed by the
problem.
 Hence, simplicity is problem-relative.
Empirical Problems


Set K of infinite input sequences.
Partition of K into alternative theories.
K
T1
T2
T3
Grove Systems

A sphere system for K is just a downward-nested
sequence of subsets of K starting with K.
K
2
1
0
Grove Systems

Think of successive differences as levels of
increasing empirical complexity in K.
2
1
0
Answer-preserving Grove Systems

No answer is split across levels.
2
1
0
Answer-preserving Grove Systems

Refine offending answer if necessary.
2
1
0
Data-driven Grove Systems


Each answer is decidable given a complexity level.
Each upward union of levels is verifiable.
2
1
0
Verifiable
Decidable
Decidable
Grove System Update

Update by restriction.
2
1
0
Grove System Update

Update by restriction
1
0
Forcible Grove Systems

At each stage, the data presented by a world at a
level are compatible with the next level up (if
there is a next level).
...
Forcible Path

A forcible restriction of a Grove system.
2
1
0
Forcible Path to Top

A forcible restriction of a Grove system that
intersects with every level.
2
1
0
Simplicity Concept

A data-driven, answer-preserving Grove system
for which each restriction to a possible data event
has a forcible path to the top.
2
1
0
Uniform Simplicity Concepts

If a data event intersects a level, it intersects each
higher level.
2
1
0
Uniform Ockham Efficiency
Theorem

Let M be a solution to a uniform problem.
The following are equivalent:
M is strongly Ockham and stalwart at e;
 M is efficient at e;
 M is strongly beaten at e.

Ockham Efficiency Theorem

Let M be a solution. The following are
equivalent:
M is always strongly Ockham and stalwart;
 M is always efficient;
 M is never weakly beaten.

V. Stochastic Ockham
Mixed Strategies

Require that the strategy converge in chance to the
true model.
Chance of producing true model at parameter q
...
Sample size
Retractions in Chance


Total drop in chance of producing an
arbitrary answer as sample size increases.
Retraction in signal, not actual retractions
due to noise.
Chance of producing true model at parameter q
...
Sample size
Ockham Efficiency


Bound retractions in chance by easy comparisons
of time and magnitude.
Ockham efficiency still follows.
(0, 0, .5, 0, 0, 0, .5, 0, 0, …)
Chance of
producing
true model at
parameter q
...
Sample size
Classification Problems


Points from plane sampled IID, labeled with
half-plane membership. Edge of half-plane is
some polynomial. What is its degree?
Uniform Ockham efficiency theorem applies.
[Cosma Shalizi]
Model Selection Problems






Random variables.
IID sampling.
Joint distribution continuously parametrized.
Partition over parameter space.
Each partition cell is a “model”.
Method maps sample sequences to models.
Two Dimensional Example




Assume: independent bivariate normal distribution
of unit variance.
Question: how many components of the joint mean
are zero?
Intuition: more nonzeros = more complex
Puzzle: How does it help to favor simplicity in lessthan-simplest worlds?
A Standard Model Selection Method


Bayes Information Criterion (BIC)
BIC(M, sample) =
- log(max prob that M can assign to sample) +
 + log(sample size)  model complexity  ½.


BIC method: choose M with least BIC score.
Official BIC Property


In the limit, minimizing BIC finds a model with
maximal conditional probability when the prior
probability is flat over models and fairly flat over
parameters within a model.
But it is also mind-change-efficient.
Toy Problem


Truth is bivariate normal of known covariance.
Count non-zero components of mean vector.
Pure Method

Acceptance zones for different answers in sample
mean space.
Simple
Complex
Performance in Simplest World


n=2
m = (0, 0).
3
Simple

2
1
0
-1
Complex
-2
95%
-2
-1
0
1
2
3
Retractions = 0
Performance in Simplest World


n=2
m = (0, 0).
3
Simple

2
1
0
-1
Complex
-2
-2
-1
0
1
2
3
Retractions = 0
Performance in Simplest World
n = 100
m = (0, 0).


1
Simple

0.75
0.5
0.25
0
-0.25
Complex
-0.5
-0.75
-0.75 -0.5 -0.25
0
0.25
0.5
0.75
1
Retractions = 0
Performance in Simplest World
n = 4,000,000
m = (0, 0).


Simple
0.0075

0.005
0.0025
0
-0.0025
Complex
-0.005
-0.0075
-0.0075-0.005-0.0025
0
0.0025 0.005 0.0075
Retractions = 0
Performance in Simplest World
n = 20,000,000
m = (0, 0).


Simple
0.006

0.004
0.002
0
-0.002
Complex
-0.004
-0.006
-0.006 -0.004 -0.002
0
0.002
0.004
0.006
Retractions = 0
Performance in Complex World


n=2
m = (.05, .005).
3
Simple

2
1
0
Complex
-1
-2
95%
-2
-1
0
1
2
3
Retractions = 0
Performance in Complex World


n = 100
m = (.05, .005).
1
Simple

0.75
0.5
0.25
0
-0.25
Complex
-0.5
-0.75
-0.75 -0.5 -0.25
0
0.25
0.5
0.75
1
Retractions = 0
Performance in Complex World


Simple
n = 30,000
m = (.05, .005).

0.1
0.05
0
Complex
-0.05
-0.1
-0.1
-0.05
0
0.05
0.1
Retractions = 1
Performance in Complex World


Simple
n = 4,000,000 (!)
m = (.05, .005).

0.04
0.02
0
Complex
-0.02
-0.04
-0.04
-0.02
0
0.02
0.04
Retractions = 2
Causal Inference from Stochastic Data

Suppose that the true linear causal model is:
Variables are standard normal
.998
X
Y
.99
Z
-.99
W
.1
Causal Inference from Stochastic Data
[Scheines, Mayo-Wilson, and Fancsali]
Sample size 40.
In 9 out of 10 samples, PC algorithm outputs:
X
Y
Z
W
Sample size 100,000.
In 9 out of 10 samples, PC outputs truth:
Variables standard normal
X
Y
Z
W
Deterministic Sub-problems
Membership
Degree = 1
w
n
Membership
degree = 0
Worst-case cost at w =
supw’ mem(w, w’) X cost(w’)
Worst-case cost = supw worst-case cost at w.
Statistical Sub-problems
Membership(p, p’) = 1 – r(p, p’)
p’
p p’
Worst-case cost at p =
supw’ mem(p, p’) X cost(p)
Worst-case cost = supp worst-case cost at p.
Future Direction


a-Consistency: Converge to production of true
answer with chance > 1 - a.
Compare worst-case timed bounds on retractions in
chance of a-consistent methods over each complexity
class.



Generalized power: minimizing retraction time forces
simple acceptance zones to be powerful.
Generalized significance: minimizing retractions forces
simple zone to be size a
Balance balance depends on a.
V. Conclusion:
Ockham’s Razor




Necessary for staying on the straightest path to
the truth
Does not point at or indicate the truth.
Works without circles, evasions, or magic.
Such a theory is motivated in counterfactual
inference and estimation.
Further Reading
(with C. Glymour) “Why Probability Does Not Capture the
Logic of Scientific Justification”, C. Hitchcock, ed., Contemporary Debates in
the Philosophy of Science, Oxford: Blackwell, 2004.
“Justification as Truth-finding Efficiency: How Ockham's
Razor Works”, Minds and Machines 14: 2004, pp. 485-505.
“Ockham's Razor, Efficiency, and the Unending Game of
Science”, forthcoming in proceedings, Foundations of the Formal Sciences 2004: Infinite
Game Theory, Springer, under review.
“How Simplicity Helps You Find the Truth Without Pointing
at it”, forthcoming, V. Harazinov, M. Friend, and N. Goethe, eds.
Philosophy of Mathematics and Induction, Dordrecht: Springer.
“Ockham's Razor, Empirical Complexity, and Truth-finding
Efficiency”, forthcoming, Theoretical Computer Science.
“Learning, Simplicity, Truth, and Misinformation”, forthcoming
inVan Benthem, J. and Adriaans, P., eds. Philosophy of Information.
II. Navigation Without a
Compass
Asking for Directions
Where’s …
Asking for Directions
Turn around. The
freeway ramp is on the
left.
Asking for Directions
Goal
Helpful Advice
Goal
Best Route
Goal
Best Route to Any Goal
Disregarding Advice is Bad
Extra U-turn
Best Route to Any Goal
…so fixed advice can help you
reach a hidden goal
without circles, evasions, or
magic.

There is no difference whatsoever in It. He goes from death to death, who sees
difference, as it were, in It [Brihadaranyaka 4.4.19-20]

"Living in the midst of ignorance and considering themselves intelligent and
enlightened, the senseless people go round and round, following crooked courses, just
like the blind led by the blind." Katha Upanishad I. ii. 5.
Academic
Academic
If there weren’t an apple on the table
I wouldn’t be a brain in a vat, so
I wouldn’t see one.
Poof !
Download