Ockham’s Razor in Causal Discovery: A New Explanation Kevin T. Kelly Conor Mayo-Wilson Department of Philosophy Joint Program in Logic and Computation Carnegie Mellon University www.hss.cmu.edu/philosophy/faculty-kelly.php I. Prediction vs. Policy Predictive Links Correlation or co-dependency allows one to predict Y from X. Lung cancer Ash trays Linked to Lung cancer! Ash trays scientist policy maker Policy Policy manipulates X to achieve a change in Y. Lung cancer Ash trays Linked to Lung cancer! Ash trays Prohibit ash trays! Policy manipulates X to achieve a change in Y. Lung cancer Policy We failed! Ash trays Correlation is not Causation Manipulation of X can destroy the correlation of X Lung cancer with Y. We failed! Ash trays Standard Remedy Lung cancer Randomized controlled study That’s what happens if you carry out the policy. Ash trays Infeasibility Expense Morality IQ Let me force a few thousand children to eat lead. Lead Infeasibility Expense Morality IQ Just joking! Lead Ironic Alliance Ha! You will never prove that lead affects IQ… IQ industry Lead Ironic Alliance IQ And you can’t throw my people out of work on a mere whim. Lead Ironic Alliance IQ So I will keep on polluting, which will never settle the matter because it is not a randomized trial. Lead II. Causes From Correlations Causal Discovery Patterns of conditional correlation can imply unambiguous causal conclusions (Pearl, Spirtes, Glymour, Scheines, etc.) Protein A Protein C Protein B Cancer protein Eliminate protein C! Basic Idea Causation is a directed, acyclic network over variables. What makes a network causal is a relation of compatibility between networks and joint probability distributions. X Y Z G compatibility Z Y p X Compatibility Joint distribution p is compatible with directed, acyclic network G iff: Causal Markov Condition: each variable X is independent of its non-effects given its immediate causes. V V Y Z Faithfulness Condition: every conditional independence relation that holds in p is a consequence of the Causal Markov Cond. X W Common Cause •B yields info about C (Faithfulness); •B yields no further info about C given A (Markov). A A B B C C Causal Chain •B yields info about C (Faithfulness); •B yields no further info about C given A (Markov). B B A A C C Common Effect •B yields no info about C (Markov); •B yields extra info about C given A (Faithfulness). B C B A C A Distinguishability indistinguishable B distinctive C A A C A B C B B C A Immediate Connections •There is an immediate causal connection between X and Y iff X is dependent on Y given every subset of variables not containing X and Y (Spirtes, Glymour and Scheines) Z X W Y Some conditioning set breaks dependency X Y No intermediate conditioning set breaks dependency Recovery of Skeleton •Apply preceding condition to recover every nonoriented immediate causal connection. X Y Y Z truth X Y Y Z skeleton Orientation of Skeleton •Look for the distinctive pattern of common effects. Common effect X Y Y Z truth X Y Y Z Orientation of Skeleton •Look for the distinctive pattern of common effects. •Draw all deductive consequences of these orientations. Common effect X Y Y Z truth X Y Y Z Y is not common effect of ZY So orientation must be downward Causation from Correlation The following network is causally unambiguous if all variables are observed. Protein A Protein C Protein B Cancer protein Causation from Correlation The red arrow is also immune to latent confounding causes Protein A Protein C Protein B Cancer protein Brave New World for Policy Experimental (confounder-proof) conclusions from correlational data! Protein A Protein C Protein B Cancer protein Eliminate protein C! III. The Catch Metaphysics vs. Inference The above results all assume that the true statistical independence relations for p are given. But they must be inferred from finite samples. Sample Inferred statistical dependencies Causal conclusions Problem of Induction Independence is indistinguishable from sufficiently small dependence at sample size n. data dependence independence Bridging the Inductive Gap Assume conditional independence until the data show otherwise. Ockham’s razor: assume no more causal complexity than necessary. Inferential Instability No guarantee that small dependencies will not be detected later. Can have spectacular impact on prior causal conclusions. Current Policy Analysis Protein A Protein C Cancer protein Protein B Eliminate protein C! As Sample Size Increases… Protein A weak Protein C Cancer protein Protein B Protein D Rescind that order! As Sample Size Increases Again… Protein A weak Protein B Protein E weak Protein C Cancer protein weak Protein D Eliminate protein C again! As Sample Size Increases Again… Protein A weak Protein E weak Protein C Cancer protein weak Protein B Etc. Protein D Eliminate protein C again! Typical Applications Linear Causal Case: each variable X is a linear function of its parents and a normally distributed hidden variable called an “error term”. The error terms are mutually independent. Discrete Multinomial Case: each variable X takes on a finite range of values. An Optimistic Concession No unobserved latent confounding causes Genetics Smoking Cancer Causal Flipping Theorem No matter what a consistent causal discovery procedure has seen so far, there exists a pair G, p satisfying the above assumptions so that the current sample is arbitrarily likely in p and the procedure produces arbitrarily many opposite conclusions in p about an arbitrary causal arrow in G as sample size increases. oops I meant oops I meant oops I meant Causal Flipping Theorem Every consistent causal inference method is covered. Therefore, multiple instability is an intrinsic feature of the causal discovery problem. oops I meant oops I meant oops I meant The Crooked Course "Living in the midst of ignorance and considering themselves intelligent and enlightened, the senseless people go round and round, following crooked courses, just like the blind led by the blind." Katha Upanishad, I. ii. 5. Extremist Reaction Since causal discovery cannot lead straight to the truth, it is not justified. I must remain silent. Therefore, I win. Moderate Reaction Many explanations have been offered to make sense of the here-today-gone-tomorrow nature of medical wisdom — what we are advised with confidence one year is reversed the next — but the simplest one is that it is the natural rhythm of science. (Do We Really Know What Makes us Healthy?, NY Times Magazine, Sept. 16, 2007). Skepticism Inverted Unavoidable retractions are justified because they are unavoidable. Avoidable retractions are not justified because they are avoidable. So the best possible methods for causal discovery are those that minimize causal retractions. The best possible means for finding the truth are justified. Larger Proposal The same holds for Ockham’s razor in general when the aim is to find the true theory. IV. Ockham’s Razor Which Theory is Right? ??? Ockham Says: Choose the Simplest! But Why? Gotcha! Puzzle An indicator must be sensitive to what it indicates. simple Puzzle An indicator must be sensitive to what it indicates. complex Puzzle But Ockham’s razor always points at simplicity. simple Puzzle But Ockham’s razor always points at simplicity. complex Puzzle How can a broken compass help you find something unless you already know where it is? complex Standard Accounts 1. Prior Simplicity Bias Bayes, BIC, MDL, MML, etc. 2. Risk Minimization SRM, AIC, cross-validation, etc. 1. Bayesian Account Ockham’s razor is a feature of one’s personal prior belief state. Short run: no objective connection with finding the truth (flipping theorem applies). Long run: converges to the truth, but other prior biases would also lead to convergence. 2. Risk Minimization Acct. Risk minimization is about prediction rather than truth. Urges using a false causal theory rather than the known true theory for predictive purposes. Therefore, not suited to exact science or to practical policy applications. V. A New Foundation for Ockham’s Razor Connections to the Truth Short-run Reliability Too strong to be feasible when theory matters. Long-run Convergence Too weak to single out Ockham’s razor Simple Complex Simple Complex Middle Path Short-run Reliability “Straightest” convergence Too strong to be feasible when theory matters. Simple Simple Complex Complex Just right? Long-run Convergence Too weak to single out Ockham’s razor Simple Complex Empirical Problems Set K of infinite input sequences. Partition of K into alternative theories. K T1 T2 T3 Empirical Methods Map finite input sequences to theories or to “?”. T3 K T1 e T2 T3 Method Choice Output history T1 T2 T3 e1 e2 e3 Input history e4 At each stage, scientist can choose a new method (agreeing with past theory choices). Aim: Converge to the Truth T3 ? T2 ? T1 T1 T1 T1 T1 T1 T1 K T1 T2 T3 ... Retraction Choosing T and then not choosing T next T T’ ? Aim: Eliminate Needless Retractions Truth Aim: Eliminate Needless Retractions Truth Aim: Eliminate Needless Delays to Retractions theory Aim: Eliminate Needless Delays to Retractions application application application application applicationcorollary theory application application corollary application corollary Why Timed Retractions? Retraction minimization = generalized significance level. Retraction time minimization = generalized power. Easy Retraction Time Comparisons Method 1 Method 2 T1 T1 T2 T2 T2 T2 T4 T4 T4 ... T1 T1 T2 T2 T2 T3 T3 T4 T4 ... at least as many at least as late Worst-case Retraction Time Bounds (1, 2, ∞) ... ... T1 T2 T3 T3 T3 T3 T4 ... T1 T2 T3 T3 T3 T4 T4 ... T1 T2 T3 T3 T4 T4 T4 ... T1 T2 T3 T4 T4 T4 T4 ... Output sequences Curve Fitting Data = open intervals around Y at rational values of X. Curve Fitting No effects: Curve Fitting First-order effect: Curve Fitting Second-order effect: Ockham There yet? Maybe. Cubic Linear Constant Quadratic Ockham There yet? Maybe. Cubic Linear Constant Quadratic Ockham There yet? Maybe. Cubic Linear Constant Quadratic Ockham There yet? Maybe. Cubic Linear Constant Quadratic Ockham Violation There yet? Maybe. Cubic Linear Constant Quadratic Ockham Violation I know you’re coming! Cubic Linear Constant Quadratic Ockham Violation Maybe. Cubic Linear Constant Quadratic Ockham Violation !!! Hmm, it’s quite nice here… Cubic Linear Constant Quadratic Ockham Violation You’re back! Learned your lesson? Cubic Linear Constant Quadratic Violator’s Path See, you shouldn’t run ahead Even if you are right! Cubic Linear Constant Quadratic Ockham Path Cubic Linear Constant Quadratic More General Argument Required Cover case in which demon has branching paths (causal discovery) More General Argument Required Cover case in which scientist lags behind (using time as a cost) Come on! Empirical Effects Empirical Effects Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Effects May take arbitrarily long to discover But can’t be taken back Empirical Theories True theory determined by which effects appear. Empirical Complexity More complex Background Constraints More complex Background Constraints More complex Ockham’s Razor Don’t select a theory unless it is uniquely simplest in light of experience. Weak Ockham’s Razor Don’t select a theory unless it among the simplest in light of experience. Stalwartness Don’t retract your answer while it is uniquely simplest Stalwartness Don’t retract your answer while it is uniquely simplest Timed Retraction Bounds r(M, e, n) = the least timed retraction bound covering the total timed retractions of M along input streams of complexity n that extend e M ... Empirical Complexity 0 1 2 3 ... Efficiency of Method M at e M converges to the truth no matter what; For each convergent M’ that agrees with M up to the end of e, and for each n: r(M, e, n) r(M’, e, n) M M’ ... Empirical Complexity 0 1 2 3 ... M is Beaten at e There exists convergent M’ that agrees with M up to the end of e, such that each n, r(M, e, n) r(M’, e, n); Exists n, r(M, e, n) > r(M’, e, n). For M M’ ... Empirical Complexity 0 1 2 3 ... Ockham Efficiency Theorem Let M be a solution. The following are equivalent: M is always strongly Ockham and stalwart; M is always efficient; M is never weakly beaten. Example: Causal Inference Effects are conditional statistical dependence relations. X dep Y | {Z}, {W}, {Z,W} ... Y dep Z | {X}, {W}, {X,W} ... X dep Z | {Y}, {Y,W} Causal Discovery = Ockham’s Razor X Y Z W Ockham’s Razor X Y X dep Y | {Z}, {W}, {Z,W} Z W Causal Discovery = Ockham’s Razor X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {Y,W} Z W Causal Discovery = Ockham’s Razor X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z W Causal Discovery = Ockham’s Razor X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z dep W| {X}, {Y}, {X,Y} Y dep W| {Z}, {X,Z} Z W Causal Discovery = Ockham’s Razor X Y X dep Y | {Z}, {W}, {Z,W} Y dep Z | {X}, {W}, {X,W} X dep Z | {Y}, {W}, {Y,W} Z dep W| {X}, {Y}, {X,Y} Y dep W| {X}, {Z}, {X,Z} Z W IV. Simplicity Defined Approach Empirical complexity reflects nested problems of induction posed by the problem. Hence, simplicity is problem-relative but topologically invariant. Empirical Problems Set K of infinite input sequences. Partition Q of K into alternative theories. K T1 T2 T3 Simplicity Concepts A simplicity concept for (K, Q) is just a wellfounded order < on a partition S of K with ascending chains of order type not exceeding omega such that: 1. Each element of S is included in some answer in Q. 2. Each downward union in (S, <) is closed; 3. Incomparable sets share no boundary point. 4. Each element of S is included in the boundary of its successor. Empirical Complexity Defined Let K|e denote the set of all possibilities compatible with observations e. Let (S, <) be a simplicity concept for (K|e, Q). Define c(w, e) = the length of the longest < path to the cell of S that contains w. Define c(T, e) = the least c(w, e) such that T is true in w. Applications Polynomial laws: complexity = degree Conservation laws: complexity = particle types – conserved quantities. Causal networks: complexity = number of logically independent conditional dependencies entailed by faithfulness. General Ockham Efficiency Theorem Let M be a solution. The following are equivalent: M is always strongly Ockham and stalwart; M is always efficient; M is never beaten. Conclusions Causal truths are necessary for counterfactual predictions. Ockham’s razor is necessary for staying on the straightest path to the true theory but does not point at the true theory. No evasions or circles are required. Future Directions Extension of unique efficiency theorem to stochastic model selection. Latent variables as Ockham conclusions. Degrees of retraction. Pooling of marginal Ockham conclusions. Retraction efficiency assessment of MDL, SRM. Suggested Reading "Ockham’s Razor, Truth, and Information", in Handbook of the Philosophy of Information, J. van Behthem and P. Adriaans, eds., to appear. "Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency", Theoretical Computer Science, 383: 270-289, 2007. Both available as pre-prints at: www.hss.cmu.edu/philosophy/faculty-kelly.php 1. Prior Simplicity Bias The simple theory is more plausible now because it was more plausible yesterday. More Subtle Version Simple data are a miracle in the complex theory but not in the simple theory. Regularity: retrograde motion of Venus at solar conjunction Has to be! P C However… e would not be a miracle given P(q); Why not this? P C The Real Miracle Ignorance about model: p(C) p(P); + Ignorance about parameter setting: p’(P(q) | P) p(P(q’ ) | P). = Knowledge about C vs. P(q): p(P(q)) << p(C). Lead into gold. Perpetual motion. Free lunch. CP q q q q q q q q Sounds good! Standard Paradox of Indifference Ignorance of red vs. not-red + Ignorance over not-red: = Knowledge about red vs. white. Knognorance = All the priveleges of knowledge With none of the responsibilities Sounds good! q q The Ellsberg Paradox 1/3 ? ? Human Preference 1/3 ? a a c ? > bb < b c Human View 1/3 ? ? knowledge a ignorance > ignorance a c bb knowledge < b c Bayesian “Rationality” 1/3 ? ? knognorance a knognorance > knognorance a c > bb knognorance b c In Any Event The coherentist foundations of Bayesianism have nothing to do with short-run truthconduciveness. Not so loud! Bayesian Convergence Too-simple theories get shot down… Updated opinion Theories Complexity Bayesian Convergence Plausibility is transferred to the next-simplest theory… Updated opinion Plink! Blam! Complexity Theories Bayesian Convergence Plausibility is transferred to the next-simplest theory… Updated opinion Plink! Blam! Complexity Theories Bayesian Convergence Plausibility is transferred to the next-simplest theory… Updated opinion Plink! Blam! Complexity Theories Bayesian Convergence The true theory is never shot down. Updated opinion Zing! Blam! Complexity Theories Convergence But alternative strategies also converge: Any theory choice in the short run is compatible with convergence in the long run. Summary of Bayesian Approach Prior-based explanations of Ockham’s razor are circular and based on a faulty model of ignorance. Convergence-based explanations of Ockham’s razor fail to single out Ockham’s razor. 2. Risk Minimization Ockham’s razor minimizes expected distance of empirical estimates from the true value. Truth Unconstrained Estimates are Centered on truth but spread around it. Pop! Pop! Pop! Pop! Unconstrained aim Constrained Estimates Off-center but less spread. Truth Clamped aim Constrained Estimates Off-center but less spread Overall improvement in expected distance from truth… Pop! Pop! Pop! Pop! Truth Clamped aim Doesn’t Find True Theory The theory that minimizes estimation risk can be quite false… Four eyes! Clamped aim Makes Sense …when loss of an answer is similar in nearby distributions. Close is good enough! Loss p Similarity But Not When Truth Matters …i.e., when loss of an answer is discontinuous with similarity. Loss Close is no cigar! p Similarity