Pitt 2006 - Department of Statistics

Rationality and Reasonableness, a Ramseyian Distinction Jim Joyce Department of Philosophy University of Michigan jjoyce@umich.edu Pitt October 13, 2006 Minimal Bayesianism • Beliefs come in varying gradations of strength. A believer’s opinions at a time can be faithfully modeled by a family of functions C that map propositions/events into [0, 1]. The person is more confident in X than in Y iff c(X) > c(Y) for all c  C. C is her credal state.[1] • Rational degrees of belief are governed by the laws of probability. Probabilistic Coherence. Every credence function c  C is such that c(T) = 1, c() = 0, c(X  Y) + c(X  Y) = c(X) + c(Y). • Learning proceeds by Bayesian Updating. A person in state C who learns new information E (and no more) should revise her beliefs by conditioning in accordance with Bayes’ Theorem, so that her new credal state is CE = {c(|E) = c( & E)/c(E) : c  C }. Bayes’ Theorem: c(H|E) = [c(H)c(E|H)]/[c(H)c(E|H) + c(~H)c(E|~H)] The Problem of the Priors According to Bayes’ theorem, a believer’s response to evidence E depends on both (a) her “prior” degrees of belief in various hypotheses, c(X), on the basis of the evidence she has before learning E, and (b) “likelihoods” c(E|X), which reflect E’s predictability given that various hypotheses obtain. Common Claim (often made by non-Bayesians): Likelihoods represent “objective” features of the situation; priors represent the individual’s “subjective” contribution. Note: The “objective” character of likelihoods is suspect, at least for composite hypotheses that can be partitioned into disjoint parts X = Y v Z each of which has a determinate likelihood. In this case we seem to have dependence on priors via c(E|X) = c(Y)c(E|Y) + c(Z)c(E|Z). The Problem: Even if likelihoods are handed down from on high, the range of allowable responses to evidence, as reflected by posterior probabilities, remains wide due to variation in priors.[2] Bayesian statistical reasoning seems infected with a rampant subjectivism that permits almost any response to any evidence. Response-1: Personalism (Savage, de Finetti) Personals suggest that we learn to live with subjectivism in statistical reasoning. While degrees of belief must be coherent, there are no further constraints on rational opinion, and so no legitimate epistemological basis on which to criticize believers who are obeying the laws of probability. Useful Picture: The prior evidence is a set of constraints that fix expected values, variances, and so on, for certain quantities. Let A be the set of all probability functions that satisfy the constraints. (Caution: Do not think of constraints as propositions to be conditioned upon!) Personalism says that any subset of A is a legitimate credal state. To sweeten the relativist pill personalists offer intersubjective agreement as a surrogate for impersonal objectivity. • • • • Washing out results. Sensitivity analyses. Dominant likelihoods. Shared views about incremental evidence despite differences in posterior probabilities (e.g., h-d confirmation, likelihood ratios). Response 2: “Objective” Bayesianism “The most elementary requirement of consistency demands that two persons with the same relevant prior information should assign the same prior probabilities. Personalistic doctrine makes no attempt to meet this requirement.” E.T. Jaynes Objective Bayesians single out certain subsets of A as legitimate credal states given the evidence, allegedly on the basis of epistemological considerations.  They see the central problem of inductive/statistical reasoning as that of choosing some single prior from A that best represents our uncertainty given our objective information given in the constraints.  This “informationless” prior is portrayed as the one that “goes least beyond the evidence” and “treats symmetrical cases equally.” The Principle of Indifference. If prior data fails to distinguish between two events, if it provides no ‘sufficient reason’ to regard one as more probable than the other, then the events should be assigned the same probability. E.g., if you only know that you lost your keys somewhere between work and home, you should assign equal probability to finding them at every point along your route. Standard Objection (Venn): Results obtained from PI depend on the way possibilities are described (e.g., as values of x or as values of x2.) Responses: (1) Find some privileged description of possibilities. This always seems to require empirical information that goes beyond the constraints and thus invokes “personal” probabilities! (2) Augment PI with symmetry principles. Symmetries to the Rescue? Translation Invariance. It can be reasonable to think that priors should not depend on the zero-point or the unit used to measure a quantity of interest x.  For zero-invariance c must derive from a probability density p such that p(x)dx = p(x + z)dx for each z. So, p must be uniform. (E.g., find the prior for the mean of a normal distribution of known variance.)  For unit-invariance c must derive from a density p(x)dx = p(ux)udx for each u > 0 . So, p(x) must be proportional to 1/x or, equivalently, p(log(x)) must be uniform. This is Jeffreys’ prior.[3] (E.g., find the prior for the variance of a normal distribution of known mean.)  For more general symmetries groups: Haar measures. Appeals to such symmetry requirements can seem to solve Venn’s problem. “The use of Jeffreys priors realizes R. A. Fisher’s ideal of ‘allowing the data to speak for themselves’.” Rosenkrantz (2006) Here is William Jeffreys, a Bayesian astronomer, commenting on a use of Venn’s argument by Elliot Sober in which x = length in meters and x2 = area in meters. “A satisfactory objective Bayesian solution has been known since the time of Harold Jeffreys in the 1930s… I don’t know of any experienced Bayesian who would want to put a uniform prior on either length or area if nothing else were known…. If one is considering quantities like length and area, the natural prior is the Haar prior that is invariant under the natural invariance group of scale changes… When one takes this constraint into account, one arrives at a prior that is inversely proportional to the length or the area, respectively. An easy calculation shows that such a prior gives identical results, regardless of whether one decides to look at length or area.” W. Jeffreys (2003) The problem is the emphasized phrase. This approach is only feasible if one already knows that probabilities depend on distances measured in (some transformation of) meters, or on areas measured in (some transformation of) meters squared, and so on. But, this is a substantive empirical assumption. Example: Crazy Henri thinks the universe is like the disk-world of Poincare’s Science and Hypothesis, and is therefore convinced that meter sticks shrink logarithmically in the direction of measurement.  Probabilities, Henri says, depend on “real distance” = log(distance x in meters) in just the way we think they depend on distance in meters.  When he applies Jeffreys’ rule he obtains a prior that is uniform over log(log(x)), not over log(x)!  Henri’s probabilities are just as “scale-invariant” as ours.[4] Henri may be mistaken, but his error is not an a priori one. Henri looks at the same evidence we do, but interprets it differently because he divides up the possibilities differently than we do. To rule him out it looks like we need personal probabilities! Entropy Maximization as a Way Out? Jaynes. When each constraint has the form j c(Xj)f(Xj) = constant, where {X1,…, Xk} a is partition (common to all constraints), and f is a convex real function, Jaynes advises us to choose the unique member of A that maximizes Entropy(c) = j c(Xj)log(c(Xj)) thereby minimizing the additional information (about the Xj) that c encapsulates.  Does this solve the Venn problem? Not really. See Seidenfeld (1986).  Additional issue. It matters whether a piece of information is treated as a constraint on a prior or as data to be conditioned on. The probability obtained by imposing two constraints can differ from the probability obtained by imposing one constraint and the conditioning on the information in the other. (Same cite) Note: Jaynes’ maintains that there is a clear distinction between constraints and items data in well-posed problems. There may be such a distinction, but it cannot be drawn a priori in the way Jaynes’ views require. The Real Problem with OB (Fisher) PI and MAXENT, whether consistent or not, are defective epistemology because they treat ignorance as if it is knowledge. Someone who knows only that a die has six sides is treated as being on an epistemic par, as far as predicting the next toss is concerned, as a person who also knows the die to be fair. General Moral. It is an error to try to capture states of ambiguous or incomplete evidence using a single prior. Going from a set of priors A that meet all the evidential constraints to a single “informationless” prior in A always involves importing loads of information into the problem. (Note: Adding information can be OK, but not under the guise of ‘logic’.) • Reply. Statistical reasoning can’t get off the ground without a prior to work with, and the MAXENT prior adds the least new information because, e.g., it treats possibilities that the data does not distinguish equally by assigning them equal probabilities. I concede that there is a sense in which the MAXENT prior adds the least new information of any sharp prior, but this is consistent with it adding a lot of new information. • The fallacious step is the last one: equal treatment does not require equal probability. Symmetries in evidence are naturally captured, and best characterized, by symmetries among elements of A. – E.g., to represent the idea that a prior is independent of the unit of measure one should not look for a single prior with p(x)dx = p(ux)udx. One should, rather, notice that for each c  A and u > 0 there is a cu  A such that c(x) = cu(ux) for all x. Response 3: Don’t Ask Don’t Tell “Subjective priors do not have any probative force… If science is about the objective public evaluation of hypotheses, these subjective feelings do not have any scientific standing. When scientists read research papers, they want information about the phenomena under study, not autobiographical results about the authors of the study. A report of the authors subjective probabilities blend these two inputs together. This is why it would be better to expunge the subjective element and let the objective likelihoods speak for themselves… I am not suggesting that we should avoid Bayesian thinking in the privacy of our own homes. If you have a subjective degree of belief in a hypothesis by all means use Bayes’ theorem to update [it] as you obtain new evidence… [but] disagreement[s] cannot be resolved by pointing to the fact that different agents have different priors.” Sober (2002, p. 23-24) Same idea: Statisticians should strive present results in ways that do not incorporate priors (using p-values, confidence intervals, and so on). It’s fine if a ‘client’ wants to process the data through his or her own priors, but that’s not the statistician’s business. M. Woodroofe: Bayesianism is OK for business, but does it belong in science? Should Bayesianism be Jettisoned? No. There is nothing wrong with Bayesianism that a little common sense and a does of externalist epistemology won’t cure.  Nothing in the Bayesian idea requires an “anything goes” subjectivism. As Ramsey (1926) noted, not all rational (= coherent) beliefs are equally reasonable since not all represent the world accurately, not all accord well with observed frequencies, not all are generated by reliable beliefforming mechanisms.  Remember: Subjective ≠ Inaccurate (Also, Objective ≠ Accurate) We can evaluate and criticize degrees of belief on the basis of features other than probabilistic coherence, thereby pursuing Ramsey’s “idea of human logic which shall not attempt to be reducible to formal logic.” (1926, p. 193) Steps Toward a Theory of Reasonableness for Priors  Ramsey’s idea that we can evaluate and criticize priors on the basis of the reliability of belief forming processes that generated them. “Empirical” Bayes Methods. In cases (involving hierarchical models) where the posterior depends on certain parameters about which we have vague or unreliable priors we can sometimes let the data fix relevant aspects of our priors.  “Calibrated” Bayes Methods. Use frequentist methods to settle on the right probability model (likelihoods and priors); use Bayesian methods for inference, estimation, and hypothesis testing.  There are problems with doing this “internally” (using, e.g., the method of Robbins (1984)), but useful for “externalistically” motivated interventions.  Sometimes the best thing to do, from the perspective of accuracy, is to junk one’s prior and start fresh by treating all one’s empirical evidence as constraints.  Joyce (1998, 2007). Nothing prevents us from evaluating priors on the basis of overall accuracy, as well as other epistemically desirable characteristics. Calibrated Bayes “Bayesian statistics is strong for inference under an assumed model, but relatively weak for the development and assessment of models. Frequentist statistics is a useful tool for model development and assessment, but a weak tool for inference under an assumed model… the natural compromise is to use frequentist methods for model development and assessment, and Bayesian methods for inference under a model. This capitalizes on the strengths of both paradigms.” Little (1985) “Sampling theory is needed for exploration and ultimate criticism of the entertained model in the light of the current data, while Bayes’ theory is needed for estimation of parameters conditional on adequacy of the model.” Box (1980) “Bayesianism, like classical Logic, is a system for keeping one’s internal beliefs selfconsistent. Neither theory is concerned with whether those beliefs are in any sense "true" beliefs about the real world… there is a need for both approaches.” Dawid “The applied statistician should be Bayesian in principle and calibrated to the real world in practice – appropriate frequency calculations help to define such a tie… [such] calculations are useful for making Bayesian statements scientific in the sense of capable of being shown wrong by empirical test; here the technique is the calibration of Bayesian probabilities to the frequencies of actual events.” Robbins (1984) Assessing the Accuracy of Degrees of Belief Question. Given a partition of hypotheses X = X1, X2,…, XN, how does one assess the accuracy of the degrees of belief c = c1, c2,…, cN when the truthvalues are given by v = v1, v2,…, vN? (Note: c need not be coherent.) An Answer (Joyce, 1998, 2007). 1. Alethic View of Degrees of Belief. Each cn is the believer’s ‘estimate’ of vn. Such estimates are assessed on a ‘gradational’ or ‘closeness counts’ scale. Note: For coherent believers ‘estimation’ = expectation. I claim it makes sense to speak of estimation (of truth-values) for incoherent believers as well. 2. Accuracy for truth-value estimates is measured using scoring rules that obey certain epistemologically motivated requirements.  A scoring rule takes each (c, v) pair to a real number S(c, v) that measures the inaccuracy of c’s estimates of the truth-values in v.  S(1 - v, v)  S(c, v)  S(v, v) = 0 for all c. Some Examples Additive: S(c, v) = n n sn(cn, vn), where each sn(c, v) gives the accuracy of c as an estimate of Xn on a scale that decreases/increases in c when v is 1/0, and where the weights n (n n = 1 and n > 0) reflect the degree to which the accuracy of credences for Xn matter to the overall accuracy. Extensional. sn(c, v) = sk(c, v) and n = k for all n and k, c  R and v  {0,1}. Absolute(c, v) = n |vn – cn|/N; s(c, 1) = 1 – c and s(c, 0) = c Brier(c, v) = n (vn – cn)2/N; s(c, 1) = (1 – c)2 and s(c, 0) = c2 Lp(c, v) = 1/N[n (vn – cn)p]1/p (not additive) Powerp(c, v): s(c, 1) = 1 – [pcp – 1 – (p – 1)cp] and s(c, 0) = (p – 1)cp Spherical(c, v): s(c, 1) = 1 – [c/(c2 + (1 – c)2)1/2] and s(c, 0) = 1 – [(1 – c)/(c2 + (1 – c)2)1/2] Log(c, v): s(c, 1)= – ln(1 – c) and s(c, 0) = – ln(c). 3. The following can be given epistemologically compelling motivations:  S is Truth-Directed. If c’s truth-value estimates are uniformly closer than those of b to the truth-values in v, then b is less accurate than c is at v.  S is Strictly Proper. If c is coherent, then its own expected inaccuracy is less than that of any other credence function b (coherent or incoherent), so that n c(Xn)S(b, vn) > n c(Xn)S(c, vn) where vn is the truth-value assignment in which Xn = 1 and all other Xm = 0. Note: this rules out Absolute and Lp  S is Convex. If d is a even mixture of c and b, so that dn = (cn + bn)/2 for every n, then d’s inaccuray is smaller than the average of the inaccuracies of c and b, so that ½S(c, v) + ½S(b, v) > S((½c + ½b), v) Note: this rules out Spherical Why Should S be Strictly Proper? • A coherent person uses expectations based on her subjective probabilities to make estimates. If S measures epistemic accuracy and c is her prior, then Ŝc(b) = n c(Xn)S(b, vn) is her estimate of b’s overall accuracy (where vn is the truth-value assignment for which vn (Xn) = 1. • If S is not strictly proper, then for some coherent c there is a b ≠ c with Ŝc(c)  Ŝc(b). A person with beliefs c will, by her own estimation, regard the beliefs b as providing at least as accurate picture of the world as her own. • Principle: A coherent believer cannot rationally hold a set of beliefs when some alternative has an equally low (or lower) expected accuracy. Note: If S is convex, we can always find b, with Ŝc(c) > Ŝc(b). • So, if S not strictly proper, some coherent subjective probabilities are not even potential states of rational belief. • But, all coherent subjective probabilities are potential states of rational belief. Why Should S be Convex?  Convexity (at a point) encourages ‘Cliffordian’ conservatism by making the accuracy costs of moving away from a truth greater than the benefits moving the same distance toward it, thereby placing greater the emphasis on the ‘avoidance of error’ as opposed to the ‘pursuit of truth’ (at that point). This makes it risky to change degrees of belief, and so discourages believers from making such changes without being compelled by their evidence.  Concavity fosters ‘Jamesian’ extremism by making the costs of moving away from a truth smaller than the benefits of moving the same distance toward it, thereby emphasizing the ‘pursuit of truth’ over ‘the avoidance of error’. This can encourage believers to alter their degrees of belief credences without corresponding changes in their evidence.  Flatness sets the accuracy costs of error and the benefits of believing the truth equal, so that small changes in belief becomes a matter of indifference. The Problem with Non-convex Scoring Rules. Using a concave or flat measure of accuracy leads to an epistemology in which the pursuit of accuracy is furthered, or at least not hindered, by the employment of belief-forming or belief-altering processes that permit degrees of belief to vary randomly and independently of the truth-values of the propositions believed. This encourages changes of opinion that are inadequately tied to corresponding changes in evidence: believers can make themselves better off, in terms of accuracy, by ignoring evidence and letting their opinions be guided by random processes that have nothing to do with the truth-value of the proposition believed. Example: Your probabilities are m = (½c + ½b), where c and b may or may not be coherent, and S(m, v) > ½S(c, v) + ½S(b, v). If (unbeknownst to you) the truth-values are as described in v, then you are objectively better off, in terms of accuracy, taking a pill that randomly shifts your subjective probabilities from m to either b or c. (The pill makes your objective expected inaccuracy lower!) It matters not that S is strictly proper. Then, your reason to refraining from taking the pill is not that it causes belief-revisions that are randomly correlated with the truth, but that you are unsure whether or not such a process will improve expected accuracy. We don’t want believers in this situation. Prospects for a Non-pragmatic Vindication of Coherence (Joyce 2007) shows that if S is truth-directed, strictly proper, and convex, then . 1. For any incoherent credence function b there is a coherent credence function c that is strictly more accurate than b under every logically possible assignment of truth-values. 2. No coherent credence function c is accuracy-dominated in this way by any other credence function c, whether coherent of incoherent. Further, it is argued that this provides a new way of justifying the requirement of probabilsitic coherence for degrees of belief, one that does not rely on “Dutch book” arguments, representation theorems. This sort of criticism of degrees of belief, based as it is on considerations of accuracy, should extend to properties other than coherence and incoherence. It will help us show that Bayesianism is OK for science as well as business! Notes [1] This terminology is due to Isaac Levi. [2] With c(E|X) and c(E|~X) given any values one can obtain any 0 ≤ p ≤ 1 as a value for c(X|E) by setting c(X) = pc(E|~X)/[(1 – p)c(E|X) + pc(E|~X)]). [3] More generally, Jeffreys showed this: Assume, for simplicity, the range of x is the interval [2, 3]. Let t(x) = t be any differentiable increasing function of x. If our probability for x is p(x) = 1/(xk) where k = 23 1/x dx = ln(3) – ln(2) and our probability for t is q(t) = [p(t–1(t))dt–1(t)/dt]/n where n = t(2)t(3) [p(t–1(t))dt–1(t)/dt] dt then for any [a, b]  [2, 3] one will have p(x  [a, b]) = ab p(x) dx = t(a)t(b) q(t) dt = q(t  [t(a), t(b)]) Special case (change of distance scale): t(x) = ux for u > 0. [4] When we start with x and think of Henri’s “real distance” as a transformation of x, t(x) = ln(x) our invariant probabilities are p(x) = 1/(x(ln(3) – ln(2)) and q(t) = 1/(x(ln(3) – ln(2)). And, we get p(x  [a, b]) = [ln(a) – ln(b)]/[ln(3) – ln(2)] = q(t  [ln(a), ln(b)]). But, Henri who starts with t and treats meters as x(t) = et ends up with the different probabilities p*(x) = 1/(xln(x)[ln(ln(3)) – ln(ln(2))]) and q*(t) = 1/(t[ln(ln(3)) – ln(ln(2))]). And, p*(x  [a, b]) = [ln(ln(a)) – ln(ln(b))]/[ln(ln(3)) – ln(ln(3))] = q*(t  [t(a), t(b)]). References Bernardo, J. and Smith F. (1994) Bayesian Theory. de Finetti, B. (1974) Theory of Probability. Box, G.E.P. (1980), “Sampling and Bayes inference in scientific modeling and robustness”, Journal of the Royal Statistical Society Series A 143, 383-430. Dawid, A.P. (1982), "The well-calibrated Bayesian," Journal of the American Statistical Association 77, 605-610. Edwards W, Lindman H & Savage L. J. (1963) “Bayesian Statistical Inference for Psychological Research,” Psychol. Review 70:193-242. Jaynes, E. T. (1968) Prior probabilities IEEE Trans. Syst. Cybern. SSC-4 227 Jaynes, E T 1973 The well-posed problem Found. Phys. 3 477, reprinted in (Jaynes 1983) Jeffreys, H. (1961) Theory of Probability. Jeffreys, W. (2003) Journal of Scientific Exploration 17 (3): 537-42 Little, R. (2005) “Calibrated Bayes: A Bayes/Frequentist Roadmap” http://sitemaker.umich.edu/rlittle/files/roadmap.pdf Rubin, D. (1984), “Bayesianly justifiable and relevant frequency calculations for the applied statistician”, Annals of Statistics 12, 1151-1172. Rosenkrantz, R. (2006) “Bayesianism” in Sarkar and Pfeifer, eds., The Philosophy of Science: an Encyclopedia (Routledge): 41-60. Seidenfeld, T. (1986) “Entropy and Uncertainty,” Philosophy of Science 53: 467--491.

Pitt 2006 - Department of Statistics

Related documents

Products

Support

Pitt 2006 - Department of Statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib