Ockham’s Razor: What it is, What it isn’t, How it works, and

advertisement
Ockham’s Razor:
What it is,
What it isn’t,
How it works, and
How it doesn’t
Kevin T. Kelly
Department of Philosophy
Carnegie Mellon University
www.cmu.edu
Further Reading
“Efficient Convergence Implies Ockham's Razor”, Proceedings of
the 2002 International Workshop on Computational Models of
Scientific Reasoning and Applications, Las Vegas, USA, June 2427, 2002.
(with C. Glymour) “Why Probability Does Not Capture the
Logic of Scientific Justification”, C. Hitchcock, ed.,
Contemporary Debates in the Philosophy of Science, Oxford:
Blackwell, 2004.
“Justification as Truth-finding Efficiency: How Ockham's
Razor Works”, Minds and Machines 14: 2004, pp. 485-505.
“Learning, Simplicity, Truth, and Misinformation”, The
Philosophy of Information, under review.
“Ockham's Razor, Efficiency, and the Infinite Game of
Science”, proceedings, Foundations of the Formal Sciences 2004:
Infinite Game Theory, Springer, under review.
Which Theory to Choose?
Compatible with data
T1
T2
T3
T4
T5
???
Use Ockham’s Razor
Simple
T1
T2
T3
Complex
Dilemma
If you know the truth is simple,
then you don’t need Ockham.
Simple
T1
T2
T3
Complex
Dilemma
If you don’t know the truth is simple,
then how could a fixed simplicity bias help you if the
truth is complex?
Simple
T1
T2
T3
Complex
Puzzle
A fixed bias is like a broken thermometer.
How could it possibly help you find unknown truth?
Cold!
I. Ockham Apologists
Wishful Thinking

Simple theories are nice if true:
Testability
 Unity
 Best explanation
 Aesthetic appeal
 Compress data


So is believing that you are the emperor.
Overfitting

Maximum likelihood estimates based on overly
complex theories can have greater predictive error
(AIC, Cross-validation, etc.).



Same is true even if you know the true model is complex.
Doesn’t converge to true model.
Depends on random data.
Thanks, but a simpler
model still has lower
predictive error.
The truth is complex.
-God.. .
.
Ignorance = Knowledge
Messy worlds are legion
Tidy worlds are few.
That is why the tidy worlds
Are those most likely true. (Carnap)
unity
Ignorance = Knowledge
Messy worlds are legion
Tidy worlds are few.
That is why the tidy worlds
Are those most likely true. (Carnap)
1/3
1/3
1/3
Ignorance = Knowledge
Messy worlds are legion
Tidy worlds are few.
That is why the tidy worlds
Are those most likely true. (Carnap)
2/6
2/6
1/6 1/6
Depends on Notation
But mess depends on coding,
which Goodman noticed, too.
The picture is inverted if
we translate green to grue.
2/6
2/6
1/6 1/6
Notation
Indicates truth?
Same for Algorithmic Complexity



Goodman’s problem works against every fixed simplicity
ranking (independent of the processes by which data are
generated and coded prior to learning).
Extra problem: any pair-wise ranking of theories can be
reversed by choosing an alternative computer language.
So how could simplicity help us find the true theory?
Notation
Indicates truth?
Just Beg the Question

Assign high prior probability to simple theories.
Why should you?
 Preference for complexity has the same “explanation”.

You presume simplicity
Therefore you should presume simplicity!
Miracle Argument

Simple data would be a miracle if a complex
theory were true (Bayes, BIC, Putnam).
Begs the Question
between theories 
bias against complex worlds.
“Fairness”
S C
Two Can Play That Game
between worlds 
bias against simple theory.
“Fairness”
S C
Convergence

At least a simplicity bias doesn’t prevent
convergence to the truth (MDL, BIC, Bayes, SGS,
etc.).
Neither do other biases.
 May as well recommend flat tires since they can be
fixed.

P
P
O
O
Does Ockham Have No Frock?
Ash Heap of
History
Philosopher’s stone,
Perpetual motion,
Free lunch
...
Ockham’s Razor???
II. How Ockham Helps
You Find the Truth
What is Guidance?

Indication or tracking
Too strong
 Fixed bias can’t indicate anything


Convergence
S
C
S
C
Too weak
 True of other biases


“Straightest” convergence

Just right?
S
C
A True Story
Niagara Falls
Clarion
Pittsburgh
A True Story
Niagara Falls
Clarion
Pittsburgh
A True Story
Niagara Falls
Clarion
Pittsburgh
A True Story
!
Clarion
Pittsburgh
Niagara Falls
A True Story
Niagara Falls
Clarion
Pittsburgh
A True Story
?
A True Story
A True Story
Ask directions!
A True Story
Where’s …
What Does She Say?
Turn around. The
freeway ramp is on
the left.
You Have Better Ideas
Phooey!
The Sun was on the right!
You Have Better Ideas
!!
You Have Better Ideas
You Have Better Ideas
You Have Better Ideas
Stay the Course!
Ahem…
Stay the Course!
Stay the Course!
Don’t Flip-flop!
Don’t Flip-flop!
Don’t Flip-flop!
Then Again…
Then Again…
Then Again…
@@$##!
One Good Flip
Can Save a Lot of Flop
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
Told ya!
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
The U-Turn
Pittsburgh
Your Route
Pittsburgh
Needless U-turn
The Best Route
Pittsburgh
Told ya!
The Best Route Anywhere
from There
Pittsburgh
Told ya!
NY, DC
The Freeway to the Truth
Complex
Simple
Told ya!


Fixed advice for all destinations
Disregarding it entails an extra course reversal…
The Freeway to the Truth
Complex
Simple
Told ya!

…even if the advice points away from the goal!
Counting Marbles
Counting Marbles
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Counting Marbles
May come at any time…
Ockham’s Razor

If you answer, answer with the current count.
3
?
Analogy




Marbles = detectable “effects”.
Late appearance = difficulty of detection.
Count = model (e.g., causal graph).
Appearance times = free parameters.
Analogy


U-turn = model revision (with content loss)
Highway = revision-efficient truth-finding
method.
T
T
The U-turn Argument


Suppose you converge to the truth but
violate Ockham’s razor along the way.
3
The U-turn Argument

Where is that extra marble, anyway?
3
The U-turn Argument

It’s not coming, is it?
3
The U-turn Argument

If you never say 2 you’ll never converge to the
truth….
3
The U-turn Argument

That’s it. You should have listened to Ockham.
3
2
The U-turn Argument

Oops! Well, no method is infallible!
3
2
The U-turn Argument

If you never say 3, you’ll never converge to the
truth….
3
2
The U-turn Argument

Embarrassing to be back at that old theory, eh?
3
23
The U-turn Argument

And so forth…
3
234
The U-turn Argument

And so forth…
3
2345
The U-turn Argument

And so forth…
3
23456
The U-turn Argument

And so forth…
3
234567
The Score

You:
Subproblem
3
234567
The Score

Ockham:
Subproblem
234567
Ockham is Necessary



If you converge to the truth,
and
you violate Ockham’s razor
then
some convergent method beats your worstcase revision bound in each answer in the
subproblem entered at the time of the
violation.
Ockham is Sufficient



If you converge to the truth,
and
you never violate Ockham’s razor
then
You achieve the worst-case revision bound of
each convergent solution in each answer in
each subproblem.
Efficiency

Efficiency = achievement of the best worstcase revision bound in each answer in each
subproblem.
Ockham Efficiency Theorem

Among the convergent methods…
Ockham = Efficient!
Efficient
Inefficient
“Mixed” Strategies



mixed strategy = chance of output depends
only on actual experience.
convergence in probability = chance of
producing true answer approaches 1 in the
limit.
efficiency = achievement of best worst-case
expected revision bound in each answer in
each subproblem.
Ockham Efficiency Theorem

Among the mixed methods that converge in
probability…
Ockham = Efficient!
Efficient
Inefficient
Dominance and “Support”

Every convergent method is weakly dominated
in revisions by a clone who says “?” until stage
n.
1. Convergence
Must leap eventually.
2. Efficiency
Only leap to simplest.
3. Dominance
Could always wait longer.
Can’t wait forever!
III. Ockham on Steroids
Ockham Wish List



General definition of Ockham’s razor.
Compare revisions even when not bounded
within answers.
Prove theorem for arbitrary empirical problems.
Empirical Problems



Problem = partition of a topological space.
Potential answers = partition cells.
Evidence = open (verifiable) propositions.
Example: Symmetry
Example: Parameter Freeing



Euclidean topology.
Say which parameters are zero.
Evidence = open neighborhood.
•Curve fitting
a1
a1 = 0 a1 > 0 a1 = 0 a1 > 0
a2 = 0 a2 = 0 a2 > 0 a2 > 0
0
a2
The Players

Scientist:


Produces an answer in response to current evidence.
Demon:

Chooses evidence in response to scientist’s choices
Winning

Scientist wins…
by default if demon doesn’t present an infinite nested
sequence of basic open sets whose intersection is a
singleton.
 else by merit if scientist eventually always produces the
true answer for world selected by demon’s choices.

Comparing Revisions

One answer sequence maps into another iff


there is an order and answer-preserving map from
the first to the second (? is wild).
Then the revisions of first are as good as those
of the second.
? ?
?
? ?
...
...
Comparing Revisions

The revisions of the first are strictly better if, in
addition, the latter doesn’t map back into the
former.
? ?
?
?
? ?
...
...
Comparing Methods
F is as good as G iff
each output sequence of F is as good as some
output sequence of G.

F
as good as
G
Comparing Methods

F is better than G iff
F is as good as G and
G is not as good as F
F
not as good as
G
Comparing Methods

F is strongly better than G iff each output
sequence of F is strictly better than an output
sequence of G but …
strictly better than
Comparing Methods

… no output sequence of G is as good as any
of F.
not as good as
Terminology

Efficient solution: as good as any solution in any
subproblem.
What Simplicity Isn’t






Only by
accident!!
Syntactic length.
Data-compression (MDL).
Computational ease.
Social “entrenchment” (Goodman).
Symptoms…
Number of free parameters (BIC, AIC).
Euclidean dimensionality
What Simplicity Is

Simpler theories are compatible with deeper
problems of induction.
Worst demon
Smaller demon
Problem of Induction


No true information entails the true answer.
Happens in answer boundaries.
Demonic Paths
A demonic path from w is a sequence of
alternating answers that a demon can force
an arbitrary convergent method through
starting from w.
0…1…2…3…4
Simplicity Defined
The A-sequences are the demonic sequences
beginning with answer A.
A is as simple as B iff each B-sequence is as
good as some A-sequence.
...
3 <
3, 4 <
3, 4, 5 <
2, 3
2, 3, 4
2, 3, 4, 5
So 2 is simpler than 3!
Ockham Answer

An answer as simple as any other answer.
= number of observed particles.
n <
n, n+1 <
n, n+1, n+2 <
...

2, …, n
2, …, n, n+1
2, …, n, n+1, n+2
So 2 is Ockham!
Ockham Lemma
A is Ockham iff
for all demonic p, (A*p) ≤ some demonic sequence.
3
I can force you
through 2
but not
through 3,2.
So 3 isn’t
Ockham
Ockham Answer
E.g.: Only simplest curve compatible with data
is Ockham.
a1
Demonic sequence:
Non-demonic sequences:
0
a2
General Efficiency Theorem
If the topology is metrizable and separable and
the question is countable then:
Ockham = Efficient.
Proof: uses Martin’s Borel Determinacy
theorem.
Stacked Problems
There is an Ockham answer at every stage.
...
1
3
2
1
0
Non-Ockham  Strongly Worse
If the problem is a stacked countable partition
over a restricted Polish space:
Each Ockham solution is strongly better
than each non-Ockham solution in the
subproblem entered at the time of the
violation.
Simplicity  Low Dimension
 Suppose
God says the true parameter value
is rational.
Simplicity  Low Dimension
 Topological
dimension and integration
theory dissolve.
 Does Ockham?
Simplicity  Low Dimension
 The
proposed account survives in the
preserved limit point structure.
IV. Ockham and Symmetry
Respect for Symmetry

If several simplest alternatives are available,
don’t break the symmetry.
•Count the marbles of each color.
•You hear the first marble but don’t see it.
•Why red rather than green?
?
Respect for Symmetry


Before the noise, (0, 0) is Ockham.
After the noise, no answer is Ockham:
Demonic
Non-demonic
(0, 0)
(1, 0) (0, 1)
(0, 1) (1, 0)
(1, 0)
(0, 1)
Right!
?
Goodman’s Riddle




Count oneicles--- a oneicle is a particle at any
stage but one, when it is a non-particle.
Oneicle tranlation is auto-homeomorphism that
does not preserve the problem.
Unique Ockham answer is current oneicle
count.
Contradicts unique Ockham answer in particle
counting.
Supersymmetry






Say when each particle appears.
Refines counting problem.
Every auto-homeomorphism preserves problem.
No answer is Ockham.
No solution is Ockham.
No method is efficient.
Dual Supersymmetry






Say only whether particle count is even or odd.
Coarsens counting problem.
Particle/Oneicle auto-homeomorphism
preserves problem.
Every answer is Ockham.
Every solution is Ockham.
Every solution is efficient.
Broken Symmetry





Count the even or just report odd.
Coarsens counting problem.
Refines the even/odd problem.
Unique Ockham answer at each stage.
Exactly Ockham solutions are efficient.
Simplicity Under Refinement
Supersymmetry
No answer is Ockham
Broken symmetry
Unique Ockham answer
Dual supersymmetry
Both answers are Ockham
Time of particle appearance
Particle counting
Oneicle counting
Twoicle counting
Particle counting
or odd particles
Oneicle counting
or odd oneicles
Twoicle counting
or odd twoicles
Even/odd
Proposed Theory is Right



Objective efficiency is grounded in problems.
Symmetries in the preceding problems would
wash out stronger simplicity distinctions.
Hence, such distinctions would amount to
mere conventions (like coordinate axes) that
couldn’t have anything to do with objective
efficiency.
Furthermore…

If Ockham’s razor is forced to choose in the
supersymmetrical problems then either:

following Ockham’s razor increases revisions in
some counting problems
Or

Ockham’s razor leads to contradictions as a
problem is coarsened or refined.
V. Conclusion
What Ockham’s Razor Is


“Only output Ockham answers”
Ockham answer = a topological invariant of the
empirical problem addressed.
What it Isn’t

preference for
brevity,
 computational ease,
 entrenchment,
 past success,
 Kolmogorov complexity,
 dimensionality, etc….

How it Works

Ockham’s razor is necessary for mininizing
revisions prior to convergence to the truth.
How it Doesn’t

No possible method could:
Point at the truth;
 Indicate the truth;
 Bound the probability of error;
 Bound the number of future revisions.

Spooky Ockham

Science without support or safety nets.
Spooky Ockham

Science without support or safety nets.
Spooky Ockham

Science without support or safety nets.
Spooky Ockham

Science without support or safety nets.
VI. Stochastic Ockham
“Mixed” Strategies

mixed strategy = chance of output depends
only on actual experience.
e
Pe(M = H at n) = Pe|n(M = H at n).
Stochastic Case


Ockham =
at each stage, you produce a non-Ockham
answer with prob = 0.
Efficiency =
achievement of the best worst-case expected
revision bound in each answer in each
subproblem over all methods that converge to
the truth in probability.
Stochastic Efficiency Theorem

Among the stochastic methods that converge
in probability, Ockham = Efficient!
Efficient
Inefficient
Stochastic Methods

Your chance of producing an answer is a function
of observations made so far.
2
p
Urn selected in light of
observations.
Stochastic U-turn Argument

Suppose you converge in probability to the truth
but produce a non-Ockham answer with prob > 0.
3
r>0
Stochastic U-turn Argument

Choose small e > 0. Consider answer 4.
3
r>0
Stochastic U-turn Argument

By convergence in probability to the truth:
3
2
r>0
p > 1 - e/3
Stochastic U-turn Argument

Etc.
3
2
r>0
p> 1-e/3
3
p > 1-e/3
4
p > 1-e/3
Stochastic U-turn Argument

Since e can be chosen arbitrarily small,
sup prob of ≥ 3 revisions ≥ r.
 sup prob of ≥ 2 revisions =1

3
2
r>0
p> 1-e/3
3
p > 1-e/3
4
p > 1-e/3
Stochastic U-turn Argument


So sup Exp revisions is ≥ 2 + 3r.
But for Ockham = 2.
3
2
r>0
p> 1-e/3
3
p > 1-e/3
Subproblem
4
p > 1-e/3
VII. Statistical Inference
(Beta Version)
The Statistical Puzzle of Simplicity



Assume: Normal distribution, s = 1, m ≥ 0.
Question: m = 0 or m > 0 ?
Intuition: m = 0 is simpler than m > 0 .
m=0
mean
Analogy

Marbles:
Time:
Simplicity:

Counting:


potentially small “effects”
sample size
fewer free parameters tied to
potential “effects”
freeing parameters in a model
U-turn “in Probability”

Convergence in probability: chance of
producing true model goes to unity

Retraction in probability: chance of producing a
model drops from above r > .5 to below 1 – r.
1
r
Chance of producing
true model
Chance of producing
alternative model
1-r
0
Sample size
Suppose You (Probably) Choose a
Model More Complex than the Truth
m=0
m>0
mean
Revision Counter: 0
> r
zone for choosing m > 0
sample mean
Eventually You Retract to the Truth
(In Probability)
m=0
mean
Revision Counter: 1
> r
zone for choosing m = 0
sample mean
So You (Probably) Output an Overly
Simple Model Nearby
m>0
mean
Revision Counter: 1
> r
zone for choosing m = 0
sample mean
Eventually You Retract to the Truth
(In Probability)
m>0
mean
zone for choosing m > 0
sample mean
Revision Counter: 2
> r
But Standard (Ockham) Testing
Practice Requires Just One Retraction!
m=0
mean
Revision Counter: 0
> r
zone for choosing m = 0
sample mean
In The Simplest World, No Retractions
m=0
mean
Revision Counter: 0
> r
zone for choosing m = 0
sample mean
In The Simplest World, No Retractions
m=0
mean
Revision Counter: 0
> r
zone for choosing m = 0
sample mean
In Remaining Worlds,
at Most One Retraction
m>0
mean
Revision Counter: 0
> r
zone for choosing m > 0
In Remaining Worlds,
at Most One Retraction
m>0
mean
Revision Counter: 0
zone for choosing m > 0
In Remaining Worlds,
at Most One Retraction
m>0
mean
Revision Counter: 1
> r
zone for choosing m > 0
So Ockham Beats All Violators


Ockham: at most one revision.
Violator: at least two revisions in worst case
Summary



Standard practice is to “test” the point
hypothesis rather than the composite
alternative.
This amounts to favoring the “simple”
hypothesis a priori.
It also minimizes revisions in probability!
Two Dimensional Example




Assume: independent bivariate normal distribution
of unit variance.
Question: how many components of the joint mean
are zero?
Intuition: more nonzeros = more complex
Puzzle: How does it help to favor simplicity in lessthan-simplest worlds?
A Real Model Selection Method


Bayes Information Criterion (BIC)
BIC(M, sample) =
- log(max prob that M can assign to sample) +
 + log(sample size)  model complexity  ½.


BIC method: choose M with least BIC score.
Official BIC Property


In the limit, minimizing BIC finds a model with
maximal conditional probability when the prior
probability is flat over models and fairly flat over
parameters within a model.
But it is also revision-efficient.
AIC in Simplest World


n=2
m = (0, 0).
3
Simple
2

1
0
-1
Complex
-2
-2
-1
0
1
2
3
Retractions = 0
AIC in Simplest World


n = 100
m = (0, 0).
0.4
Simple

0.2
0
-0.2
Complex
-0.4
-0.4
-0.2
0
0.2
0.4
Retractions = 0
AIC in Simplest World


n = 4,000,000
m = (0, 0).
0.004
Simple

0.002
0
-0.002
Complex
-0.004
-0.004
-0.002
0
0.002
0.004
Retractions = 0
BIC in Simplest World


n=2
m = (0, 0).
3
Simple

2
1
0
-1
Complex
-2
-2
-1
0
1
2
3
Retractions = 0
BIC in Simplest World
n = 100
m = (0, 0).


1
Simple

0.75
0.5
0.25
0
-0.25
Complex
-0.5
-0.75
-0.75 -0.5 -0.25
0
0.25
0.5
0.75
1
Retractions = 0
BIC in Simplest World
n = 4,000,000
m = (0, 0).


Simple
0.0075

0.005
0.0025
0
-0.0025
Complex
-0.005
-0.0075
-0.0075-0.005-0.0025
0
0.0025 0.005 0.0075
Retractions = 0
BIC in Simplest World
n = 20,000,000
m = (0, 0).


Simple
0.006

0.004
0.002
0
-0.002
Complex
-0.004
-0.006
-0.006 -0.004 -0.002
0
0.002
0.004
0.006
Retractions = 0
Performance in Complex World


n=2
m = (.05, .005).
3
Simple

2
1
0
Complex
-1
-2
95%
-2
-1
0
1
2
3
Retractions = 0
Performance in Complex World


n = 100
m = (.05, .005).
1
Simple

0.75
0.5
0.25
0
-0.25
Complex
-0.5
-0.75
-0.75 -0.5 -0.25
0
0.25
0.5
0.75
1
Retractions = 0
Performance in Complex World


Simple
n = 30,000
m = (.05, .005).

0.1
0.05
0
Complex
-0.05
-0.1
-0.1
-0.05
0
0.05
0.1
Retractions = 1
Performance in Complex World


Simple
n = 4,000,000 (!)
m = (.05, .005).

0.04
0.02
0
Complex
-0.02
-0.04
-0.04
-0.02
0
0.02
0.04
Retractions = 2
Question




Does the statistical retraction minimization story extend to
violations in less-than-simplest worlds?
Recall that the deterministic argument for higher
retractions required the concept of minimizing retractions
in each “subproblem”.
A “subproblem” is a proposition verified at a given time in
a given world.
Some analogue “in probability” is required.
Subproblem.

H is an a -subroblem in w at n:

There is a likelihood ratio test of {w} at significance < a such
that this test has power < 1 - a at each world in H.
worlds
H
w
sample size = n
> 1- a
>a
reject
>a
accept
reject
Significance Schedules

A significance schedule a(.) is a monotone decreasing
sequence of significance levels converging to zero that
drop so slowly that power can be increased
monotonically with sample size.
n+1
n
a(n+1)
a(n)
Ockham Violation  Inefficient
(mX, mY)
Subproblem
At sample size n
Ockham Violation  Inefficient
(mX, mY)
Subproblem
At sample size n
Ockham violation:
Probably say blue hypothesis
at white world
(p > r)
Ockham Violation  Inefficient
(mX, mY)
Subproblem
at time of violation
Probably say blue
Probably say white
Ockham Violation  Inefficient
(mX, mY)
Subproblem
at time of violation
Probably say blue
Probably say white
Ockham Violation  Inefficient
(mX, mY)
Subproblem
at time of violation
Probably say blue
Probably say white
Probably say blue
Oops! Ockham  Inefficient
(mX, mY)
Subproblem
Oops! Ockham  Inefficient
(mX, mY)
Subproblem
Oops! Ockham  Inefficient
(mX, mY)
Subproblem
Oops! Ockham  Inefficient
(mX, mY)
Subproblem
Oops! Ockham  Inefficient
(mX, mY)
Subproblem
Two retractions
Local Retraction Efficiency

Ockham does as well as best subproblem
performance in some neighborhood of w.
(mX, mY)
Subproblem
At most one retraction
Two retractions
Ockham Violation  Inefficient

Note: no neighborhood around w avoids extra
retractions.
(mX, mY)
Subproblem
at time of violation
w
Gonzo Ockham  Inefficient

Gonzo = probably saying simplest answer in
entire subproblem entered in simplest world.
(mX, mY)
Subproblem
Balance




Be Ockham (avoid complexity)
Don’t be Gonzo Ockham (avoid bad fit).
Truth-directed: sole aim is to find true
model with minimal revisions!
No circles: totally worst-case; no prior bias
toward simple worlds.
THE END
Download