Ockham’s Razor: What it is, What it isn’t, How it works, and How it doesn’t Kevin T. Kelly Department of Philosophy Carnegie Mellon University www.cmu.edu Further Reading “Efficient Convergence Implies Ockham's Razor”, Proceedings of the 2002 International Workshop on Computational Models of Scientific Reasoning and Applications, Las Vegas, USA, June 2427, 2002. (with C. Glymour) “Why Probability Does Not Capture the Logic of Scientific Justification”, C. Hitchcock, ed., Contemporary Debates in the Philosophy of Science, Oxford: Blackwell, 2004. “Justification as Truth-finding Efficiency: How Ockham's Razor Works”, Minds and Machines 14: 2004, pp. 485-505. “Learning, Simplicity, Truth, and Misinformation”, The Philosophy of Information, under review. “Ockham's Razor, Efficiency, and the Infinite Game of Science”, proceedings, Foundations of the Formal Sciences 2004: Infinite Game Theory, Springer, under review. Which Theory to Choose? Compatible with data T1 T2 T3 T4 T5 ??? Use Ockham’s Razor Simple T1 T2 T3 Complex Dilemma If you know the truth is simple, then you don’t need Ockham. Simple T1 T2 T3 Complex Dilemma If you don’t know the truth is simple, then how could a fixed simplicity bias help you if the truth is complex? Simple T1 T2 T3 Complex Puzzle A fixed bias is like a broken thermometer. How could it possibly help you find unknown truth? Cold! I. Ockham Apologists Wishful Thinking Simple theories are nice if true: Testability Unity Best explanation Aesthetic appeal Compress data So is believing that you are the emperor. Overfitting Maximum likelihood estimates based on overly complex theories can have greater predictive error (AIC, Cross-validation, etc.). Same is true even if you know the true model is complex. Doesn’t converge to true model. Depends on random data. Thanks, but a simpler model still has lower predictive error. The truth is complex. -God.. . . Ignorance = Knowledge Messy worlds are legion Tidy worlds are few. That is why the tidy worlds Are those most likely true. (Carnap) unity Ignorance = Knowledge Messy worlds are legion Tidy worlds are few. That is why the tidy worlds Are those most likely true. (Carnap) 1/3 1/3 1/3 Ignorance = Knowledge Messy worlds are legion Tidy worlds are few. That is why the tidy worlds Are those most likely true. (Carnap) 2/6 2/6 1/6 1/6 Depends on Notation But mess depends on coding, which Goodman noticed, too. The picture is inverted if we translate green to grue. 2/6 2/6 1/6 1/6 Notation Indicates truth? Same for Algorithmic Complexity Goodman’s problem works against every fixed simplicity ranking (independent of the processes by which data are generated and coded prior to learning). Extra problem: any pair-wise ranking of theories can be reversed by choosing an alternative computer language. So how could simplicity help us find the true theory? Notation Indicates truth? Just Beg the Question Assign high prior probability to simple theories. Why should you? Preference for complexity has the same “explanation”. You presume simplicity Therefore you should presume simplicity! Miracle Argument Simple data would be a miracle if a complex theory were true (Bayes, BIC, Putnam). Begs the Question between theories bias against complex worlds. “Fairness” S C Two Can Play That Game between worlds bias against simple theory. “Fairness” S C Convergence At least a simplicity bias doesn’t prevent convergence to the truth (MDL, BIC, Bayes, SGS, etc.). Neither do other biases. May as well recommend flat tires since they can be fixed. P P O O Does Ockham Have No Frock? Ash Heap of History Philosopher’s stone, Perpetual motion, Free lunch ... Ockham’s Razor??? II. How Ockham Helps You Find the Truth What is Guidance? Indication or tracking Too strong Fixed bias can’t indicate anything Convergence S C S C Too weak True of other biases “Straightest” convergence Just right? S C A True Story Niagara Falls Clarion Pittsburgh A True Story Niagara Falls Clarion Pittsburgh A True Story Niagara Falls Clarion Pittsburgh A True Story ! Clarion Pittsburgh Niagara Falls A True Story Niagara Falls Clarion Pittsburgh A True Story ? A True Story A True Story Ask directions! A True Story Where’s … What Does She Say? Turn around. The freeway ramp is on the left. You Have Better Ideas Phooey! The Sun was on the right! You Have Better Ideas !! You Have Better Ideas You Have Better Ideas You Have Better Ideas Stay the Course! Ahem… Stay the Course! Stay the Course! Don’t Flip-flop! Don’t Flip-flop! Don’t Flip-flop! Then Again… Then Again… Then Again… @@$##! One Good Flip Can Save a Lot of Flop Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh Told ya! The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh The U-Turn Pittsburgh Your Route Pittsburgh Needless U-turn The Best Route Pittsburgh Told ya! The Best Route Anywhere from There Pittsburgh Told ya! NY, DC The Freeway to the Truth Complex Simple Told ya! Fixed advice for all destinations Disregarding it entails an extra course reversal… The Freeway to the Truth Complex Simple Told ya! …even if the advice points away from the goal! Counting Marbles Counting Marbles Counting Marbles May come at any time… Counting Marbles May come at any time… Counting Marbles May come at any time… Counting Marbles May come at any time… Counting Marbles May come at any time… Counting Marbles May come at any time… Counting Marbles May come at any time… Ockham’s Razor If you answer, answer with the current count. 3 ? Analogy Marbles = detectable “effects”. Late appearance = difficulty of detection. Count = model (e.g., causal graph). Appearance times = free parameters. Analogy U-turn = model revision (with content loss) Highway = revision-efficient truth-finding method. T T The U-turn Argument Suppose you converge to the truth but violate Ockham’s razor along the way. 3 The U-turn Argument Where is that extra marble, anyway? 3 The U-turn Argument It’s not coming, is it? 3 The U-turn Argument If you never say 2 you’ll never converge to the truth…. 3 The U-turn Argument That’s it. You should have listened to Ockham. 3 2 The U-turn Argument Oops! Well, no method is infallible! 3 2 The U-turn Argument If you never say 3, you’ll never converge to the truth…. 3 2 The U-turn Argument Embarrassing to be back at that old theory, eh? 3 23 The U-turn Argument And so forth… 3 234 The U-turn Argument And so forth… 3 2345 The U-turn Argument And so forth… 3 23456 The U-turn Argument And so forth… 3 234567 The Score You: Subproblem 3 234567 The Score Ockham: Subproblem 234567 Ockham is Necessary If you converge to the truth, and you violate Ockham’s razor then some convergent method beats your worstcase revision bound in each answer in the subproblem entered at the time of the violation. Ockham is Sufficient If you converge to the truth, and you never violate Ockham’s razor then You achieve the worst-case revision bound of each convergent solution in each answer in each subproblem. Efficiency Efficiency = achievement of the best worstcase revision bound in each answer in each subproblem. Ockham Efficiency Theorem Among the convergent methods… Ockham = Efficient! Efficient Inefficient “Mixed” Strategies mixed strategy = chance of output depends only on actual experience. convergence in probability = chance of producing true answer approaches 1 in the limit. efficiency = achievement of best worst-case expected revision bound in each answer in each subproblem. Ockham Efficiency Theorem Among the mixed methods that converge in probability… Ockham = Efficient! Efficient Inefficient Dominance and “Support” Every convergent method is weakly dominated in revisions by a clone who says “?” until stage n. 1. Convergence Must leap eventually. 2. Efficiency Only leap to simplest. 3. Dominance Could always wait longer. Can’t wait forever! III. Ockham on Steroids Ockham Wish List General definition of Ockham’s razor. Compare revisions even when not bounded within answers. Prove theorem for arbitrary empirical problems. Empirical Problems Problem = partition of a topological space. Potential answers = partition cells. Evidence = open (verifiable) propositions. Example: Symmetry Example: Parameter Freeing Euclidean topology. Say which parameters are zero. Evidence = open neighborhood. •Curve fitting a1 a1 = 0 a1 > 0 a1 = 0 a1 > 0 a2 = 0 a2 = 0 a2 > 0 a2 > 0 0 a2 The Players Scientist: Produces an answer in response to current evidence. Demon: Chooses evidence in response to scientist’s choices Winning Scientist wins… by default if demon doesn’t present an infinite nested sequence of basic open sets whose intersection is a singleton. else by merit if scientist eventually always produces the true answer for world selected by demon’s choices. Comparing Revisions One answer sequence maps into another iff there is an order and answer-preserving map from the first to the second (? is wild). Then the revisions of first are as good as those of the second. ? ? ? ? ? ... ... Comparing Revisions The revisions of the first are strictly better if, in addition, the latter doesn’t map back into the former. ? ? ? ? ? ? ... ... Comparing Methods F is as good as G iff each output sequence of F is as good as some output sequence of G. F as good as G Comparing Methods F is better than G iff F is as good as G and G is not as good as F F not as good as G Comparing Methods F is strongly better than G iff each output sequence of F is strictly better than an output sequence of G but … strictly better than Comparing Methods … no output sequence of G is as good as any of F. not as good as Terminology Efficient solution: as good as any solution in any subproblem. What Simplicity Isn’t Only by accident!! Syntactic length. Data-compression (MDL). Computational ease. Social “entrenchment” (Goodman). Symptoms… Number of free parameters (BIC, AIC). Euclidean dimensionality What Simplicity Is Simpler theories are compatible with deeper problems of induction. Worst demon Smaller demon Problem of Induction No true information entails the true answer. Happens in answer boundaries. Demonic Paths A demonic path from w is a sequence of alternating answers that a demon can force an arbitrary convergent method through starting from w. 0…1…2…3…4 Simplicity Defined The A-sequences are the demonic sequences beginning with answer A. A is as simple as B iff each B-sequence is as good as some A-sequence. ... 3 < 3, 4 < 3, 4, 5 < 2, 3 2, 3, 4 2, 3, 4, 5 So 2 is simpler than 3! Ockham Answer An answer as simple as any other answer. = number of observed particles. n < n, n+1 < n, n+1, n+2 < ... 2, …, n 2, …, n, n+1 2, …, n, n+1, n+2 So 2 is Ockham! Ockham Lemma A is Ockham iff for all demonic p, (A*p) ≤ some demonic sequence. 3 I can force you through 2 but not through 3,2. So 3 isn’t Ockham Ockham Answer E.g.: Only simplest curve compatible with data is Ockham. a1 Demonic sequence: Non-demonic sequences: 0 a2 General Efficiency Theorem If the topology is metrizable and separable and the question is countable then: Ockham = Efficient. Proof: uses Martin’s Borel Determinacy theorem. Stacked Problems There is an Ockham answer at every stage. ... 1 3 2 1 0 Non-Ockham Strongly Worse If the problem is a stacked countable partition over a restricted Polish space: Each Ockham solution is strongly better than each non-Ockham solution in the subproblem entered at the time of the violation. Simplicity Low Dimension Suppose God says the true parameter value is rational. Simplicity Low Dimension Topological dimension and integration theory dissolve. Does Ockham? Simplicity Low Dimension The proposed account survives in the preserved limit point structure. IV. Ockham and Symmetry Respect for Symmetry If several simplest alternatives are available, don’t break the symmetry. •Count the marbles of each color. •You hear the first marble but don’t see it. •Why red rather than green? ? Respect for Symmetry Before the noise, (0, 0) is Ockham. After the noise, no answer is Ockham: Demonic Non-demonic (0, 0) (1, 0) (0, 1) (0, 1) (1, 0) (1, 0) (0, 1) Right! ? Goodman’s Riddle Count oneicles--- a oneicle is a particle at any stage but one, when it is a non-particle. Oneicle tranlation is auto-homeomorphism that does not preserve the problem. Unique Ockham answer is current oneicle count. Contradicts unique Ockham answer in particle counting. Supersymmetry Say when each particle appears. Refines counting problem. Every auto-homeomorphism preserves problem. No answer is Ockham. No solution is Ockham. No method is efficient. Dual Supersymmetry Say only whether particle count is even or odd. Coarsens counting problem. Particle/Oneicle auto-homeomorphism preserves problem. Every answer is Ockham. Every solution is Ockham. Every solution is efficient. Broken Symmetry Count the even or just report odd. Coarsens counting problem. Refines the even/odd problem. Unique Ockham answer at each stage. Exactly Ockham solutions are efficient. Simplicity Under Refinement Supersymmetry No answer is Ockham Broken symmetry Unique Ockham answer Dual supersymmetry Both answers are Ockham Time of particle appearance Particle counting Oneicle counting Twoicle counting Particle counting or odd particles Oneicle counting or odd oneicles Twoicle counting or odd twoicles Even/odd Proposed Theory is Right Objective efficiency is grounded in problems. Symmetries in the preceding problems would wash out stronger simplicity distinctions. Hence, such distinctions would amount to mere conventions (like coordinate axes) that couldn’t have anything to do with objective efficiency. Furthermore… If Ockham’s razor is forced to choose in the supersymmetrical problems then either: following Ockham’s razor increases revisions in some counting problems Or Ockham’s razor leads to contradictions as a problem is coarsened or refined. V. Conclusion What Ockham’s Razor Is “Only output Ockham answers” Ockham answer = a topological invariant of the empirical problem addressed. What it Isn’t preference for brevity, computational ease, entrenchment, past success, Kolmogorov complexity, dimensionality, etc…. How it Works Ockham’s razor is necessary for mininizing revisions prior to convergence to the truth. How it Doesn’t No possible method could: Point at the truth; Indicate the truth; Bound the probability of error; Bound the number of future revisions. Spooky Ockham Science without support or safety nets. Spooky Ockham Science without support or safety nets. Spooky Ockham Science without support or safety nets. Spooky Ockham Science without support or safety nets. VI. Stochastic Ockham “Mixed” Strategies mixed strategy = chance of output depends only on actual experience. e Pe(M = H at n) = Pe|n(M = H at n). Stochastic Case Ockham = at each stage, you produce a non-Ockham answer with prob = 0. Efficiency = achievement of the best worst-case expected revision bound in each answer in each subproblem over all methods that converge to the truth in probability. Stochastic Efficiency Theorem Among the stochastic methods that converge in probability, Ockham = Efficient! Efficient Inefficient Stochastic Methods Your chance of producing an answer is a function of observations made so far. 2 p Urn selected in light of observations. Stochastic U-turn Argument Suppose you converge in probability to the truth but produce a non-Ockham answer with prob > 0. 3 r>0 Stochastic U-turn Argument Choose small e > 0. Consider answer 4. 3 r>0 Stochastic U-turn Argument By convergence in probability to the truth: 3 2 r>0 p > 1 - e/3 Stochastic U-turn Argument Etc. 3 2 r>0 p> 1-e/3 3 p > 1-e/3 4 p > 1-e/3 Stochastic U-turn Argument Since e can be chosen arbitrarily small, sup prob of ≥ 3 revisions ≥ r. sup prob of ≥ 2 revisions =1 3 2 r>0 p> 1-e/3 3 p > 1-e/3 4 p > 1-e/3 Stochastic U-turn Argument So sup Exp revisions is ≥ 2 + 3r. But for Ockham = 2. 3 2 r>0 p> 1-e/3 3 p > 1-e/3 Subproblem 4 p > 1-e/3 VII. Statistical Inference (Beta Version) The Statistical Puzzle of Simplicity Assume: Normal distribution, s = 1, m ≥ 0. Question: m = 0 or m > 0 ? Intuition: m = 0 is simpler than m > 0 . m=0 mean Analogy Marbles: Time: Simplicity: Counting: potentially small “effects” sample size fewer free parameters tied to potential “effects” freeing parameters in a model U-turn “in Probability” Convergence in probability: chance of producing true model goes to unity Retraction in probability: chance of producing a model drops from above r > .5 to below 1 – r. 1 r Chance of producing true model Chance of producing alternative model 1-r 0 Sample size Suppose You (Probably) Choose a Model More Complex than the Truth m=0 m>0 mean Revision Counter: 0 > r zone for choosing m > 0 sample mean Eventually You Retract to the Truth (In Probability) m=0 mean Revision Counter: 1 > r zone for choosing m = 0 sample mean So You (Probably) Output an Overly Simple Model Nearby m>0 mean Revision Counter: 1 > r zone for choosing m = 0 sample mean Eventually You Retract to the Truth (In Probability) m>0 mean zone for choosing m > 0 sample mean Revision Counter: 2 > r But Standard (Ockham) Testing Practice Requires Just One Retraction! m=0 mean Revision Counter: 0 > r zone for choosing m = 0 sample mean In The Simplest World, No Retractions m=0 mean Revision Counter: 0 > r zone for choosing m = 0 sample mean In The Simplest World, No Retractions m=0 mean Revision Counter: 0 > r zone for choosing m = 0 sample mean In Remaining Worlds, at Most One Retraction m>0 mean Revision Counter: 0 > r zone for choosing m > 0 In Remaining Worlds, at Most One Retraction m>0 mean Revision Counter: 0 zone for choosing m > 0 In Remaining Worlds, at Most One Retraction m>0 mean Revision Counter: 1 > r zone for choosing m > 0 So Ockham Beats All Violators Ockham: at most one revision. Violator: at least two revisions in worst case Summary Standard practice is to “test” the point hypothesis rather than the composite alternative. This amounts to favoring the “simple” hypothesis a priori. It also minimizes revisions in probability! Two Dimensional Example Assume: independent bivariate normal distribution of unit variance. Question: how many components of the joint mean are zero? Intuition: more nonzeros = more complex Puzzle: How does it help to favor simplicity in lessthan-simplest worlds? A Real Model Selection Method Bayes Information Criterion (BIC) BIC(M, sample) = - log(max prob that M can assign to sample) + + log(sample size) model complexity ½. BIC method: choose M with least BIC score. Official BIC Property In the limit, minimizing BIC finds a model with maximal conditional probability when the prior probability is flat over models and fairly flat over parameters within a model. But it is also revision-efficient. AIC in Simplest World n=2 m = (0, 0). 3 Simple 2 1 0 -1 Complex -2 -2 -1 0 1 2 3 Retractions = 0 AIC in Simplest World n = 100 m = (0, 0). 0.4 Simple 0.2 0 -0.2 Complex -0.4 -0.4 -0.2 0 0.2 0.4 Retractions = 0 AIC in Simplest World n = 4,000,000 m = (0, 0). 0.004 Simple 0.002 0 -0.002 Complex -0.004 -0.004 -0.002 0 0.002 0.004 Retractions = 0 BIC in Simplest World n=2 m = (0, 0). 3 Simple 2 1 0 -1 Complex -2 -2 -1 0 1 2 3 Retractions = 0 BIC in Simplest World n = 100 m = (0, 0). 1 Simple 0.75 0.5 0.25 0 -0.25 Complex -0.5 -0.75 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 Retractions = 0 BIC in Simplest World n = 4,000,000 m = (0, 0). Simple 0.0075 0.005 0.0025 0 -0.0025 Complex -0.005 -0.0075 -0.0075-0.005-0.0025 0 0.0025 0.005 0.0075 Retractions = 0 BIC in Simplest World n = 20,000,000 m = (0, 0). Simple 0.006 0.004 0.002 0 -0.002 Complex -0.004 -0.006 -0.006 -0.004 -0.002 0 0.002 0.004 0.006 Retractions = 0 Performance in Complex World n=2 m = (.05, .005). 3 Simple 2 1 0 Complex -1 -2 95% -2 -1 0 1 2 3 Retractions = 0 Performance in Complex World n = 100 m = (.05, .005). 1 Simple 0.75 0.5 0.25 0 -0.25 Complex -0.5 -0.75 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 Retractions = 0 Performance in Complex World Simple n = 30,000 m = (.05, .005). 0.1 0.05 0 Complex -0.05 -0.1 -0.1 -0.05 0 0.05 0.1 Retractions = 1 Performance in Complex World Simple n = 4,000,000 (!) m = (.05, .005). 0.04 0.02 0 Complex -0.02 -0.04 -0.04 -0.02 0 0.02 0.04 Retractions = 2 Question Does the statistical retraction minimization story extend to violations in less-than-simplest worlds? Recall that the deterministic argument for higher retractions required the concept of minimizing retractions in each “subproblem”. A “subproblem” is a proposition verified at a given time in a given world. Some analogue “in probability” is required. Subproblem. H is an a -subroblem in w at n: There is a likelihood ratio test of {w} at significance < a such that this test has power < 1 - a at each world in H. worlds H w sample size = n > 1- a >a reject >a accept reject Significance Schedules A significance schedule a(.) is a monotone decreasing sequence of significance levels converging to zero that drop so slowly that power can be increased monotonically with sample size. n+1 n a(n+1) a(n) Ockham Violation Inefficient (mX, mY) Subproblem At sample size n Ockham Violation Inefficient (mX, mY) Subproblem At sample size n Ockham violation: Probably say blue hypothesis at white world (p > r) Ockham Violation Inefficient (mX, mY) Subproblem at time of violation Probably say blue Probably say white Ockham Violation Inefficient (mX, mY) Subproblem at time of violation Probably say blue Probably say white Ockham Violation Inefficient (mX, mY) Subproblem at time of violation Probably say blue Probably say white Probably say blue Oops! Ockham Inefficient (mX, mY) Subproblem Oops! Ockham Inefficient (mX, mY) Subproblem Oops! Ockham Inefficient (mX, mY) Subproblem Oops! Ockham Inefficient (mX, mY) Subproblem Oops! Ockham Inefficient (mX, mY) Subproblem Two retractions Local Retraction Efficiency Ockham does as well as best subproblem performance in some neighborhood of w. (mX, mY) Subproblem At most one retraction Two retractions Ockham Violation Inefficient Note: no neighborhood around w avoids extra retractions. (mX, mY) Subproblem at time of violation w Gonzo Ockham Inefficient Gonzo = probably saying simplest answer in entire subproblem entered in simplest world. (mX, mY) Subproblem Balance Be Ockham (avoid complexity) Don’t be Gonzo Ockham (avoid bad fit). Truth-directed: sole aim is to find true model with minimal revisions! No circles: totally worst-case; no prior bias toward simple worlds. THE END