Truth-conduciveness Without Reliability: A Non-Theological Explanation of Ockham’s Razor Kevin T. Kelly Department of Philosophy Carnegie Mellon University www.cmu.edu I. The Puzzle Which Theory is True? ??? Ockham Says: Choose the Simplest! But Why? Gotcha! Puzzle An indicator must be sensitive to what it indicates. simple Puzzle An indicator must be sensitive to what it indicates. complex Puzzle But Ockham’s razor always points at simplicity. simple Puzzle But Ockham’s razor always points at simplicity. complex Puzzle If a broken compass is known to point North, then we already know where North is. complex Puzzle But then who needs the compass? complex Proposed Answers 1. 2. 3. Evasive Circular Magical A. Evasions Truth A. Evasions Truth Virtues Simple theories have virtues: Testable Unified Explanatory Symmetrical Bold Compress data Virtues Simple theories have virtues: Testable Unified Explanatory Symmetrical Bold Compress data But to assume that the truth has these virtues is wishful thinking. [van Fraassen] Convergence At least a simplicity bias doesn’t prevent convergence to the truth. truth Complexity Convergence At least a simplicity bias doesn’t prevent convergence to the truth. truth Plink! Blam! Complexity Convergence At least a simplicity bias doesn’t prevent convergence to the truth. truth Blam! Plink! Complexity Convergence At least a simplicity bias doesn’t prevent convergence to the truth. truth Plink! Blam! Complexity Convergence Convergence allows for any theory choice whatever in the short run, so this is not an argument for Ockham’s razor now. truth Alternative ranking Overfitting Empirical estimates based on complex models have greater expected distance from the truth Truth Overfitting Empirical estimates based on complex models have greater expected distance from the truth. Pop! Pop! Pop! Pop! Overfitting Empirical estimates based on complex models have greater expected distance from the truth. Truth clamp Overfitting Empirical estimates based on complex models have greater expected distance from the truth. Pop! Pop! Pop! Pop! Truth clamp Overfitting ...even if the simple theory is known to be false… Four eyes! clamp C. Circles Prior Probability Assign high prior probability to simple theories. Simplicity is plausible now because it was yesterday. Miracle Argument e would not be a miracle given C; e would be a miracle given P. q P C Miracle Argument e would not be a miracle given C; e would be a miracle given P. q’ C S However… e would not be a miracle given P(q); Why not this? q C S The Real Miracle Ignorance about model: p(C) p(P); + Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P). = Knowledge about C vs. P(q): p(P(q)) << p(C). CP q q q q q q q q Is it knognorance or Ignoredge? The Ellsberg Paradox 1/3 ? ? The Ellsberg Paradox 1/3 ? ? > 1/3 Human betting preferences > The Ellsberg Paradox 1/3 ? ? > 1/3 < 1/3 Human betting preferences > > Human View knowledge 1/3 ignorance ? ? Human betting preferences > > Bayesian View ignoredge 1/3 ignoredge 1/3 1/3 Human betting preferences > > Moral 1/3 ? ? Even in the most mundane contexts, when Bayesians offer to replace our ignorance with ignoredge, we vote with our feet. Probable Tracking 1. If the simple theory S were true, then the data would probably be simple so Ockham’s razor would probably believe S. 2. If the simple theory S were false, then the complex alternative theory C would be true, so the data would probably be complex so you would probably believe C rather than S. Probable Tracking Given that you use Ockham’s razor: p(B(S) | S) = p(eS | S) = 1. p(not-B(S) | not-S) = 1 - p(eS | C) = 1. Probable Tracking Given that you use Ockham’s razor: p(B(C) | C) = 1 = probability that the data look simple given C. p(B(C) | not-C) = 0 = probability that the data look simple given alternative theory P. B. Magic Truth Simplicity Magic Simplicity informs via hidden causes. G Leibniz, evolution Simple B(Simple) Kant Simple B(Simple) Ouija board Simple B(Simple) Magic Simpler to explain Ockham’s razor without hidden causes. ? Reductio of Naturalism (Koons 2000) Suppose that the crucial probabilities p(Tq | T) in the Bayesian miracle argument are natural chances, so that Ockham’s razor really is reliable. Suppose that T is the fundamental theory of natural chance, so that Tq determines the true pq for some choice of q. But if pt(Tq) is defined at all, it should be 1 if t = q and 0 otherwise. So natural science can only produce fundamental knowledge of natural chance if there are non-natural chances. Diagnosis Indication or tracking Simple Complex Simple Complex Too strong: Circles, evasions, or magic required. Convergence Too weak Doesn’t single out simplicity Diagnosis Indication or tracking Too strong: Circles or magic required. Convergence Simple Complex Simple Complex Too weak Doesn’t single out simplicity “Straightest” convergence Just right? Simple Complex II. Straightest Convergence Simple Complex Empirical Problems Set K of infinite input sequences. Partition of K into alternative theories. K T1 T2 T3 Empirical Methods Map finite input sequences to theories or to “?”. T3 K T1 e T2 T3 Method Choice Output history T1 T2 T3 e1 e2 e3 Input history e4 At each stage, scientist can choose a new method (agreeing with past theory choices). Aim: Converge to the Truth T3 ? T2 ? T1 T1 T1 T1 T1 T1 T1 K T1 T2 T3 ... Retraction Choosing T and then not choosing T next T T’ ? Aim: Eliminate Needless Retractions Truth Aim: Eliminate Needless Retractions Truth Aim: Eliminate Needless Delays to Retractions theory Aim: Eliminate Needless Delays to Retractions application application application application applicationcorollary theory application application corollary application corollary Easy Retraction Time Comparisons Method 1 Method 2 T1 T1 T2 T2 T2 T2 T4 T4 T4 ... T1 T1 T2 T2 T2 T3 T3 T4 T4 ... at least as many at least as late Worst-case Retraction Time Bounds (1, 2, ∞) ... ... T1 T2 T3 T3 T3 T3 T4 ... T1 T2 T3 T3 T3 T4 T4 ... T1 T2 T3 T3 T4 T4 T4 ... T1 T2 T3 T4 T4 T4 T4 ... Output sequences II. Ockham Without Circles, Evasions, or Magic Curve Fitting Data = open intervals around Y at rational values of X. Curve Fitting No effects: Curve Fitting First-order effect: Curve Fitting Second-order effect: Empirical Effects Empirical Effects Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Effects May take arbitrarily long to discover Empirical Theories True theory determined by which effects appear. Empirical Complexity More complex Background Constraints More complex Background Constraints ? More complex Background Constraints ? More complex Ockham’s Razor Don’t select a theory unless it is uniquely simplest in light of experience. Weak Ockham’s Razor Don’t select a theory unless it among the simplest in light of experience. Stalwartness Don’t retract your answer while it is uniquely simplest Stalwartness Don’t retract your answer while it is uniquely simplest Uniform Problems All paths of accumulating effects starting at a level have the same length. Timed Retraction Bounds r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e M ... Empirical Complexity 0 1 2 3 ... Efficiency of Method M at e M converges to the truth no matter what; For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n) M M’ ... Empirical Complexity 0 1 2 3 ... M is Strongly Beaten at e There exists convergent M’ that agrees with M up to the end of e, such that For each n, r(M, e, n) > r(M’, e, n). M M’ ... Empirical Complexity 0 1 2 3 ... M is Weakly Beaten at e There exists convergent M’ that agrees with M up to the end of e, such that each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n). For M M’ ... Empirical Complexity 0 1 2 3 ... Idea No matter what convergent M has done in the past, nature can force M to produce each answer down an arbitrary effect path, arbitrarily often. Nature can also force violators of Ockham’s razor or stalwartness either into an extra retraction or a late retraction in each complexity class. Ockham Violation with Retraction Ockham violation Extra retraction in each complexity class Ockham Violation without Retraction Ockham violation Late retraction in each complexity class Uniform Ockham Efficiency Theorem Let M be a solution to a uniform problem. The following are equivalent: M is strongly Ockham and stalwart at e; M is efficient at e; M is not strongly beaten at e. Idea Similar, but if convergent M already violates strong Ockham’s razor by favoring an answer T at the root of a longer path, sticking with T may reduce retractions in complexity classes reached only along the longer path. Violation Favoring Shorter Path Non-uniform problem ? Ockham violation Late or extra retraction in each complexity class Violation Favoring Longer Path without Retraction Non-uniform problem ? Ouch! Extra retraction in each complexity class! Ockham violation But at First Violation… Non-uniform problem ? ? ? First Ockham violation Breaks even each class. But at First Violation… Non-uniform problem ? ? ? First Ockham violation Breaks even each class. Loses in class 0 when truth is red. Ockham Efficiency Theorem Let M be a solution. The following are equivalent: M is always strongly Ockham and stalwart; M is always efficient; M is never weakly beaten. Application: Causal Inference Causal graph theory: more correlations more causes. partial correlations S G(S) Idealized data = list of conditional dependencies discovered so far. Anomaly = the addition of a conditional dependency to the list. Causal Path Rule X, Y are dependent conditional on set S of variables not containing X, Y iff X, Y are connected by at least one path in which: no non-collider is in S and each collider has a descendent in S. X Y S [Pearl, SGS] Forcible Sequence of Models X Y Z W Forcible Sequence of Models X Y X dep Y | {Z}, {W}, {Z,W} Z W Forcible Sequence of Models X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {Y,W} Z W Forcible Sequence of Models X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z W Forcible Sequence of Models X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z dep W| {X}, {Y}, {X,Y} Y dep W| {Z}, {X,Z} Z W Forcible Sequence of Models X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z dep W| {X}, {Y}, {X,Y} Y dep W| {X}, {Z}, {X,Z} Z W Policy Prediction Consistent policy estimator can be forced into retractions. “Failure of uniform consistency”. No non-trivial confidence interval. [Robins, Wasserman, Zhang] Y Z Y Z Y Z Y Z Moral Not true model vs. prediction. Issue: actual vs. counterfactual model selection and prediction. In counterfactual prediction, form of model matters and retractions are unavoidable. Y Z Y Z Y Z Y Z IV. Simplicity Aim General definition of simplicity. Prove Ockham efficiency theorem for general definition. Approach Empirical complexity reflects nested problems of induction posed by the problem. Hence, simplicity is problem-relative. Empirical Problems Set K of infinite input sequences. Partition of K into alternative theories. K T1 T2 T3 Grove Systems A sphere system for K is just a downward-nested sequence of subsets of K starting with K. K 2 1 0 Grove Systems Think of successive differences as levels of increasing empirical complexity in K. 2 1 0 Answer-preserving Grove Systems No answer is split across levels. 2 1 0 Answer-preserving Grove Systems Refine offending answer if necessary. 2 1 0 Data-driven Grove Systems Each answer is decidable given a complexity level. Each upward union of levels is verifiable. 2 1 0 Verifiable Decidable Decidable Grove System Update Update by restriction. 2 1 0 Grove System Update Update by restriction 1 0 Forcible Grove Systems At each stage, the data presented by a world at a level are compatible with the next level up (if there is a next level). ... Forcible Path A forcible restriction of a Grove system. 2 1 0 Forcible Path to Top A forcible restriction of a Grove system that intersects with every level. 2 1 0 Simplicity Concept A data-driven, answer-preserving Grove system for which each restriction to a possible data event has a forcible path to the top. 2 1 0 Uniform Simplicity Concepts If a data event intersects a level, it intersects each higher level. 2 1 0 Uniform Ockham Efficiency Theorem Let M be a solution to a uniform problem. The following are equivalent: M is strongly Ockham and stalwart at e; M is efficient at e; M is strongly beaten at e. Ockham Efficiency Theorem Let M be a solution. The following are equivalent: M is always strongly Ockham and stalwart; M is always efficient; M is never weakly beaten. V. Stochastic Ockham Mixed Strategies Require that the strategy converge in chance to the true model. Chance of producing true model at parameter q ... Sample size Retractions in Chance Total drop in chance of producing an arbitrary answer as sample size increases. Retraction in signal, not actual retractions due to noise. Chance of producing true model at parameter q ... Sample size Ockham Efficiency Bound retractions in chance by easy comparisons of time and magnitude. Ockham efficiency still follows. (0, 0, .5, 0, 0, 0, .5, 0, 0, …) Chance of producing true model at parameter q ... Sample size Classification Problems Points from plane sampled IID, labeled with half-plane membership. Edge of half-plane is some polynomial. What is its degree? Uniform Ockham efficiency theorem applies. [Cosma Shalizi] Model Selection Problems Random variables. IID sampling. Joint distribution continuously parametrized. Partition over parameter space. Each partition cell is a “model”. Method maps sample sequences to models. Two Dimensional Example Assume: independent bivariate normal distribution of unit variance. Question: how many components of the joint mean are zero? Intuition: more nonzeros = more complex Puzzle: How does it help to favor simplicity in lessthan-simplest worlds? A Standard Model Selection Method Bayes Information Criterion (BIC) BIC(M, sample) = - log(max prob that M can assign to sample) + + log(sample size) model complexity ½. BIC method: choose M with least BIC score. Official BIC Property In the limit, minimizing BIC finds a model with maximal conditional probability when the prior probability is flat over models and fairly flat over parameters within a model. But it is also mind-change-efficient. Toy Problem Truth is bivariate normal of known covariance. Count non-zero components of mean vector. Pure Method Acceptance zones for different answers in sample mean space. Simple Complex Performance in Simplest World n=2 m = (0, 0). 3 Simple 2 1 0 -1 Complex -2 95% -2 -1 0 1 2 3 Retractions = 0 Performance in Simplest World n=2 m = (0, 0). 3 Simple 2 1 0 -1 Complex -2 -2 -1 0 1 2 3 Retractions = 0 Performance in Simplest World n = 100 m = (0, 0). 1 Simple 0.75 0.5 0.25 0 -0.25 Complex -0.5 -0.75 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 Retractions = 0 Performance in Simplest World n = 4,000,000 m = (0, 0). Simple 0.0075 0.005 0.0025 0 -0.0025 Complex -0.005 -0.0075 -0.0075-0.005-0.0025 0 0.0025 0.005 0.0075 Retractions = 0 Performance in Simplest World n = 20,000,000 m = (0, 0). Simple 0.006 0.004 0.002 0 -0.002 Complex -0.004 -0.006 -0.006 -0.004 -0.002 0 0.002 0.004 0.006 Retractions = 0 Performance in Complex World n=2 m = (.05, .005). 3 Simple 2 1 0 Complex -1 -2 95% -2 -1 0 1 2 3 Retractions = 0 Performance in Complex World n = 100 m = (.05, .005). 1 Simple 0.75 0.5 0.25 0 -0.25 Complex -0.5 -0.75 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 Retractions = 0 Performance in Complex World Simple n = 30,000 m = (.05, .005). 0.1 0.05 0 Complex -0.05 -0.1 -0.1 -0.05 0 0.05 0.1 Retractions = 1 Performance in Complex World Simple n = 4,000,000 (!) m = (.05, .005). 0.04 0.02 0 Complex -0.02 -0.04 -0.04 -0.02 0 0.02 0.04 Retractions = 2 Causal Inference from Stochastic Data Suppose that the true linear causal model is: Variables are standard normal .998 X Y .99 Z -.99 W .1 Causal Inference from Stochastic Data [Scheines, Mayo-Wilson, and Fancsali] Sample size 40. In 9 out of 10 samples, PC algorithm outputs: X Y Z W Sample size 100,000. In 9 out of 10 samples, PC outputs truth: Variables standard normal X Y Z W Deterministic Sub-problems Membership Degree = 1 w n Membership degree = 0 Worst-case cost at w = supw’ mem(w, w’) X cost(w’) Worst-case cost = supw worst-case cost at w. Statistical Sub-problems Membership(p, p’) = 1 – r(p, p’) p’ p p’ Worst-case cost at p = supw’ mem(p, p’) X cost(p) Worst-case cost = supp worst-case cost at p. Future Direction a-Consistency: Converge to production of true answer with chance > 1 - a. Compare worst-case timed bounds on retractions in chance of a-consistent methods over each complexity class. Generalized power: minimizing retraction time forces simple acceptance zones to be powerful. Generalized significance: minimizing retractions forces simple zone to be size a Balance balance depends on a. V. Conclusion: Ockham’s Razor Necessary for staying on the straightest path to the truth Does not point at or indicate the truth. Works without circles, evasions, or magic. Such a theory is motivated in counterfactual inference and estimation. Further Reading (with C. Glymour) “Why Probability Does Not Capture the Logic of Scientific Justification”, C. Hitchcock, ed., Contemporary Debates in the Philosophy of Science, Oxford: Blackwell, 2004. “Justification as Truth-finding Efficiency: How Ockham's Razor Works”, Minds and Machines 14: 2004, pp. 485-505. “Ockham's Razor, Efficiency, and the Unending Game of Science”, forthcoming in proceedings, Foundations of the Formal Sciences 2004: Infinite Game Theory, Springer, under review. “How Simplicity Helps You Find the Truth Without Pointing at it”, forthcoming, V. Harazinov, M. Friend, and N. Goethe, eds. Philosophy of Mathematics and Induction, Dordrecht: Springer. “Ockham's Razor, Empirical Complexity, and Truth-finding Efficiency”, forthcoming, Theoretical Computer Science. “Learning, Simplicity, Truth, and Misinformation”, forthcoming inVan Benthem, J. and Adriaans, P., eds. Philosophy of Information. II. Navigation Without a Compass Asking for Directions Where’s … Asking for Directions Turn around. The freeway ramp is on the left. Asking for Directions Goal Helpful Advice Goal Best Route Goal Best Route to Any Goal Disregarding Advice is Bad Extra U-turn Best Route to Any Goal …so fixed advice can help you reach a hidden goal without circles, evasions, or magic. There is no difference whatsoever in It. He goes from death to death, who sees difference, as it were, in It [Brihadaranyaka 4.4.19-20] "Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad I. ii. 5. Academic Academic If there weren’t an apple on the table I wouldn’t be a brain in a vat, so I wouldn’t see one. Poof !