Pitt 2006 - Department of Statistics

advertisement
Rationality and Reasonableness,
a Ramseyian Distinction
Jim Joyce
Department of Philosophy
University of Michigan
jjoyce@umich.edu
Pitt
October 13, 2006
Minimal Bayesianism
• Beliefs come in varying gradations of strength.
A believer’s opinions at a time can be faithfully modeled by a family of
functions C that map propositions/events into [0, 1]. The person is more
confident in X than in Y iff c(X) > c(Y) for all c  C. C is her credal state.[1]
• Rational degrees of belief are governed by the laws of probability.
Probabilistic Coherence. Every credence function c  C is such that
c(T) = 1, c() = 0, c(X  Y) + c(X  Y) = c(X) + c(Y).
• Learning proceeds by Bayesian Updating.
A person in state C who learns new information E (and no more) should
revise her beliefs by conditioning in accordance with Bayes’ Theorem, so
that her new credal state is CE = {c(|E) = c( & E)/c(E) : c  C }.
Bayes’ Theorem: c(H|E) = [c(H)c(E|H)]/[c(H)c(E|H) + c(~H)c(E|~H)]
The Problem of the Priors
According to Bayes’ theorem, a believer’s response to evidence E depends on both
(a) her “prior” degrees of belief in various hypotheses, c(X), on the basis of the
evidence she has before learning E, and (b) “likelihoods” c(E|X), which reflect E’s
predictability given that various hypotheses obtain.
Common Claim (often made by non-Bayesians): Likelihoods represent “objective”
features of the situation; priors represent the individual’s “subjective” contribution.
Note: The “objective” character of likelihoods is suspect, at least for composite
hypotheses that can be partitioned into disjoint parts X = Y v Z each of which has a
determinate likelihood. In this case we seem to have dependence on priors via
c(E|X) = c(Y)c(E|Y) + c(Z)c(E|Z).
The Problem: Even if likelihoods are handed down from on high, the range of
allowable responses to evidence, as reflected by posterior probabilities, remains
wide due to variation in priors.[2] Bayesian statistical reasoning seems infected with
a rampant subjectivism that permits almost any response to any evidence.
Response-1: Personalism (Savage, de Finetti)
Personals suggest that we learn to live with subjectivism in statistical reasoning.
While degrees of belief must be coherent, there are no further constraints on
rational opinion, and so no legitimate epistemological basis on which to
criticize believers who are obeying the laws of probability.
Useful Picture: The prior evidence is a set of constraints that fix expected
values, variances, and so on, for certain quantities. Let A be the set of all
probability functions that satisfy the constraints.
(Caution: Do not think of constraints as propositions to be conditioned upon!)
Personalism says that any subset of A is a legitimate credal state.
To sweeten the relativist pill personalists offer intersubjective agreement as a
surrogate for impersonal objectivity.
•
•
•
•
Washing out results.
Sensitivity analyses.
Dominant likelihoods.
Shared views about incremental evidence despite differences in
posterior probabilities (e.g., h-d confirmation, likelihood ratios).
Response 2: “Objective” Bayesianism
“The most elementary requirement of consistency demands that two
persons with the same relevant prior information should assign the same
prior probabilities. Personalistic doctrine makes no attempt to meet this
requirement.” E.T. Jaynes
Objective Bayesians single out certain subsets of A as legitimate credal states
given the evidence, allegedly on the basis of epistemological considerations.
 They see the central problem of inductive/statistical reasoning as that of
choosing some single prior from A that best represents our uncertainty given
our objective information given in the constraints.
 This “informationless” prior is portrayed as the one that “goes least beyond
the evidence” and “treats symmetrical cases equally.”
The Principle of Indifference.
If prior data fails to distinguish between two events, if it provides no ‘sufficient
reason’ to regard one as more probable than the other, then the events should be
assigned the same probability.
E.g., if you only know that you lost your keys somewhere between work and home, you
should assign equal probability to finding them at every point along your route.
Standard Objection (Venn): Results obtained from PI depend on the way
possibilities are described (e.g., as values of x or as values of x2.)
Responses: (1) Find some privileged description of possibilities.
This always seems to require empirical information that goes
beyond the constraints and thus invokes “personal” probabilities!
(2) Augment PI with symmetry principles.
Symmetries to the Rescue?
Translation Invariance. It can be reasonable to think that priors should not
depend on the zero-point or the unit used to measure a quantity of interest x.
 For zero-invariance c must derive from a probability density p such that
p(x)dx = p(x + z)dx for each z. So, p must be uniform.
(E.g., find the prior for the mean of a normal distribution of known variance.)
 For unit-invariance c must derive from a density p(x)dx = p(ux)udx for
each u > 0 . So, p(x) must be proportional to 1/x or, equivalently, p(log(x))
must be uniform. This is Jeffreys’ prior.[3]
(E.g., find the prior for the variance of a normal distribution of known mean.)
 For more general symmetries groups: Haar measures.
Appeals to such symmetry requirements can seem to solve Venn’s problem.
“The use of Jeffreys priors realizes R. A. Fisher’s ideal of ‘allowing the data
to speak for themselves’.” Rosenkrantz (2006)
Here is William Jeffreys, a Bayesian astronomer, commenting on a use of Venn’s
argument by Elliot Sober in which x = length in meters and x2 = area in meters.
“A satisfactory objective Bayesian solution has been known since the time of
Harold Jeffreys in the 1930s… I don’t know of any experienced Bayesian
who would want to put a uniform prior on either length or area if nothing else
were known…. If one is considering quantities like length and area, the
natural prior is the Haar prior that is invariant under the natural invariance
group of scale changes… When one takes this constraint into account, one
arrives at a prior that is inversely proportional to the length or the area,
respectively. An easy calculation shows that such a prior gives identical
results, regardless of whether one decides to look at length or area.”
W. Jeffreys (2003)
The problem is the emphasized phrase.
This approach is only feasible if one already knows that probabilities depend
on distances measured in (some transformation of) meters, or on areas
measured in (some transformation of) meters squared, and so on. But, this
is a substantive empirical assumption.
Example: Crazy Henri thinks the universe is like the disk-world of Poincare’s
Science and Hypothesis, and is therefore convinced that meter sticks shrink
logarithmically in the direction of measurement.
 Probabilities, Henri says, depend on “real distance” = log(distance x
in meters) in just the way we think they depend on distance in meters.
 When he applies Jeffreys’ rule he obtains a prior that is uniform over
log(log(x)), not over log(x)!
 Henri’s probabilities are just as “scale-invariant” as ours.[4]
Henri may be mistaken, but his error is not an a priori one. Henri looks at
the same evidence we do, but interprets it differently because he divides up
the possibilities differently than we do.
To rule him out it looks like we need personal probabilities!
Entropy Maximization as a Way Out?
Jaynes. When each constraint has the form j c(Xj)f(Xj) = constant, where
{X1,…, Xk} a is partition (common to all constraints), and f is a convex real
function, Jaynes advises us to choose the unique member of A that maximizes
Entropy(c) = j c(Xj)log(c(Xj))
thereby minimizing the additional information (about the Xj) that c encapsulates.
 Does this solve the Venn problem? Not really. See Seidenfeld (1986).
 Additional issue. It matters whether a piece of information is treated as a
constraint on a prior or as data to be conditioned on. The probability obtained
by imposing two constraints can differ from the probability obtained by imposing
one constraint and the conditioning on the information in the other. (Same cite)
Note: Jaynes’ maintains that there is a clear distinction between constraints and
items data in well-posed problems. There may be such a distinction, but it cannot be
drawn a priori in the way Jaynes’ views require.
The Real Problem with OB (Fisher)
PI and MAXENT, whether consistent or not, are defective epistemology because
they treat ignorance as if it is knowledge.
Someone who knows only that a die has six sides is treated as being on an epistemic
par, as far as predicting the next toss is concerned, as a person who also knows the
die to be fair.
General Moral. It is an error to try to capture states of ambiguous or incomplete
evidence using a single prior. Going from a set of priors A that meet all the
evidential constraints to a single “informationless” prior in A always involves
importing loads of information into the problem.
(Note: Adding information can be OK, but not under the guise of ‘logic’.)
• Reply. Statistical reasoning can’t get off the ground without a prior to work
with, and the MAXENT prior adds the least new information because, e.g., it
treats possibilities that the data does not distinguish equally by assigning
them equal probabilities.
I concede that there is a sense in which the MAXENT prior adds the least new information
of any sharp prior, but this is consistent with it adding a lot of new information.
• The fallacious step is the last one: equal treatment does not require equal
probability. Symmetries in evidence are naturally captured, and best
characterized, by symmetries among elements of A.
– E.g., to represent the idea that a prior is independent of the unit of
measure one should not look for a single prior with p(x)dx = p(ux)udx.
One should, rather, notice that for each c  A and u > 0 there is a cu  A
such that c(x) = cu(ux) for all x.
Response 3: Don’t Ask Don’t Tell
“Subjective priors do not have any probative force… If science is about the objective
public evaluation of hypotheses, these subjective feelings do not have any scientific
standing. When scientists read research papers, they want information about the
phenomena under study, not autobiographical results about the authors of the study.
A report of the authors subjective probabilities blend these two inputs together. This
is why it would be better to expunge the subjective element and let the objective
likelihoods speak for themselves… I am not suggesting that we should avoid
Bayesian thinking in the privacy of our own homes. If you have a subjective degree
of belief in a hypothesis by all means use Bayes’ theorem to update [it] as you
obtain new evidence… [but] disagreement[s] cannot be resolved by pointing to the
fact that different agents have different priors.” Sober (2002, p. 23-24)
Same idea: Statisticians should strive present results in ways that do not
incorporate priors (using p-values, confidence intervals, and so on). It’s fine if a
‘client’ wants to process the data through his or her own priors, but that’s not
the statistician’s business.
M. Woodroofe: Bayesianism is OK for business, but does it belong in science?
Should Bayesianism be Jettisoned?
No. There is nothing wrong with Bayesianism that a little common sense and
a does of externalist epistemology won’t cure.
 Nothing in the Bayesian idea requires an “anything goes” subjectivism.
As Ramsey (1926) noted, not all rational (= coherent) beliefs are equally
reasonable since not all represent the world accurately, not all accord
well with observed frequencies, not all are generated by reliable beliefforming mechanisms.
 Remember: Subjective ≠ Inaccurate (Also, Objective ≠ Accurate)
We can evaluate and criticize degrees of belief on the basis of features other
than probabilistic coherence, thereby pursuing Ramsey’s “idea of human
logic which shall not attempt to be reducible to formal logic.” (1926, p. 193)
Steps Toward a Theory of Reasonableness for Priors
 Ramsey’s idea that we can evaluate and criticize priors on the basis of the
reliability of belief forming processes that generated them.
“Empirical” Bayes Methods. In cases (involving hierarchical models) where the
posterior depends on certain parameters about which we have vague or
unreliable priors we can sometimes let the data fix relevant aspects of our priors.
 “Calibrated” Bayes Methods. Use frequentist methods to settle on the right
probability model (likelihoods and priors); use Bayesian methods for inference,
estimation, and hypothesis testing.
 There are problems with doing this “internally” (using, e.g., the method of Robbins
(1984)), but useful for “externalistically” motivated interventions.
 Sometimes the best thing to do, from the perspective of accuracy, is to junk one’s
prior and start fresh by treating all one’s empirical evidence as constraints.
 Joyce (1998, 2007). Nothing prevents us from evaluating priors on the basis
of overall accuracy, as well as other epistemically desirable characteristics.
Calibrated Bayes
“Bayesian statistics is strong for inference under an assumed model, but relatively weak for
the development and assessment of models. Frequentist statistics is a useful tool for model
development and assessment, but a weak tool for inference under an assumed model… the
natural compromise is to use frequentist methods for model development and assessment,
and Bayesian methods for inference under a model. This capitalizes on the strengths of
both paradigms.” Little (1985)
“Sampling theory is needed for exploration and ultimate criticism of the entertained model in
the light of the current data, while Bayes’ theory is needed for estimation of parameters
conditional on adequacy of the model.” Box (1980)
“Bayesianism, like classical Logic, is a system for keeping one’s internal beliefs selfconsistent. Neither theory is concerned with whether those beliefs are in any sense "true"
beliefs about the real world… there is a need for both approaches.” Dawid
“The applied statistician should be Bayesian in principle and calibrated to the real world in
practice – appropriate frequency calculations help to define such a tie… [such] calculations
are useful for making Bayesian statements scientific in the sense of capable of being shown
wrong by empirical test; here the technique is the calibration of Bayesian probabilities to the
frequencies of actual events.” Robbins (1984)
Assessing the Accuracy of Degrees of Belief
Question. Given a partition of hypotheses X = X1, X2,…, XN, how does one
assess the accuracy of the degrees of belief c = c1, c2,…, cN when the truthvalues are given by v = v1, v2,…, vN? (Note: c need not be coherent.)
An Answer (Joyce, 1998, 2007).
1. Alethic View of Degrees of Belief. Each cn is the believer’s ‘estimate’ of vn.
Such estimates are assessed on a ‘gradational’ or ‘closeness counts’ scale.
Note: For coherent believers ‘estimation’ = expectation. I claim it makes sense
to speak of estimation (of truth-values) for incoherent believers as well.
2. Accuracy for truth-value estimates is measured using scoring rules that obey
certain epistemologically motivated requirements.
 A scoring rule takes each (c, v) pair to a real number S(c, v) that
measures the inaccuracy of c’s estimates of the truth-values in v.
 S(1 - v, v)  S(c, v)  S(v, v) = 0 for all c.
Some Examples
Additive: S(c, v) = n n sn(cn, vn), where each sn(c, v) gives the accuracy of
c as an estimate of Xn on a scale that decreases/increases in c when v is 1/0,
and where the weights n (n n = 1 and n > 0) reflect the degree to which the
accuracy of credences for Xn matter to the overall accuracy.
Extensional. sn(c, v) = sk(c, v) and n = k for all n and k, c  R and v  {0,1}.
Absolute(c, v) = n |vn – cn|/N;
s(c, 1) = 1 – c and s(c, 0) = c
Brier(c, v) = n (vn – cn)2/N; s(c, 1) = (1 – c)2 and s(c, 0) = c2
Lp(c, v) = 1/N[n (vn – cn)p]1/p (not additive)
Powerp(c, v): s(c, 1) = 1 – [pcp – 1 – (p – 1)cp] and s(c, 0) = (p – 1)cp
Spherical(c, v): s(c, 1) = 1 – [c/(c2 + (1 – c)2)1/2] and s(c, 0) = 1 – [(1 – c)/(c2 + (1 – c)2)1/2]
Log(c, v): s(c, 1)= – ln(1 – c) and s(c, 0) = – ln(c).
3. The following can be given epistemologically compelling motivations:
 S is Truth-Directed. If c’s truth-value estimates are uniformly closer than
those of b to the truth-values in v, then b is less accurate than c is at v.
 S is Strictly Proper. If c is coherent, then its own expected inaccuracy is less
than that of any other credence function b (coherent or incoherent), so that
n c(Xn)S(b, vn) > n c(Xn)S(c, vn)
where vn is the truth-value assignment in which Xn = 1 and all other Xm = 0.
Note: this rules out Absolute and Lp
 S is Convex. If d is a even mixture of c and b, so that dn = (cn + bn)/2 for
every n, then d’s inaccuray is smaller than the average of the inaccuracies
of c and b, so that ½S(c, v) + ½S(b, v) > S((½c + ½b), v)
Note: this rules out Spherical
Why Should S be Strictly Proper?
• A coherent person uses expectations based on her subjective probabilities
to make estimates. If S measures epistemic accuracy and c is her prior, then
Ŝc(b) = n c(Xn)S(b, vn) is her estimate of b’s overall accuracy (where vn is the
truth-value assignment for which vn (Xn) = 1.
• If S is not strictly proper, then for some coherent c there is a b ≠ c with
Ŝc(c)  Ŝc(b). A person with beliefs c will, by her own estimation, regard the
beliefs b as providing at least as accurate picture of the world as her own.
• Principle: A coherent believer cannot rationally hold a set of beliefs when
some alternative has an equally low (or lower) expected accuracy.
Note: If S is convex, we can always find b, with Ŝc(c) > Ŝc(b).
• So, if S not strictly proper, some coherent subjective probabilities are not
even potential states of rational belief.
• But, all coherent subjective probabilities are potential states of rational belief.
Why Should S be Convex?
 Convexity (at a point) encourages ‘Cliffordian’ conservatism by making the
accuracy costs of moving away from a truth greater than the benefits moving the
same distance toward it, thereby placing greater the emphasis on the
‘avoidance of error’ as opposed to the ‘pursuit of truth’ (at that point).
This makes it risky to change degrees of belief, and so discourages believers from
making such changes without being compelled by their evidence.
 Concavity fosters ‘Jamesian’ extremism by making the costs of moving
away from a truth smaller than the benefits of moving the same distance toward
it, thereby emphasizing the ‘pursuit of truth’ over ‘the avoidance of error’.
This can encourage believers to alter their degrees of belief credences without
corresponding changes in their evidence.
 Flatness sets the accuracy costs of error and the benefits of believing the
truth equal, so that small changes in belief becomes a matter of indifference.
The Problem with Non-convex Scoring Rules.
Using a concave or flat measure of accuracy leads to an epistemology in which
the pursuit of accuracy is furthered, or at least not hindered, by the employment
of belief-forming or belief-altering processes that permit degrees of belief to vary
randomly and independently of the truth-values of the propositions believed.
This encourages changes of opinion that are inadequately tied to corresponding
changes in evidence: believers can make themselves better off, in terms of accuracy,
by ignoring evidence and letting their opinions be guided by random processes that
have nothing to do with the truth-value of the proposition believed.
Example: Your probabilities are m = (½c + ½b), where c and b may or may not
be coherent, and S(m, v) > ½S(c, v) + ½S(b, v). If (unbeknownst to you) the
truth-values are as described in v, then you are objectively better off, in terms of
accuracy, taking a pill that randomly shifts your subjective probabilities from m to
either b or c. (The pill makes your objective expected inaccuracy lower!)
It matters not that S is strictly proper. Then, your reason to refraining from taking the
pill is not that it causes belief-revisions that are randomly correlated with the truth, but
that you are unsure whether or not such a process will improve expected accuracy.
We don’t want believers in this situation.
Prospects for a Non-pragmatic Vindication of Coherence
(Joyce 2007) shows that if S is truth-directed, strictly proper, and convex, then .
1. For any incoherent credence function b there is a coherent credence function
c that is strictly more accurate than b under every logically possible
assignment of truth-values.
2. No coherent credence function c is accuracy-dominated in this way by any
other credence function c, whether coherent of incoherent.
Further, it is argued that this provides a new way of justifying the requirement of
probabilsitic coherence for degrees of belief, one that does not rely on “Dutch
book” arguments, representation theorems.
This sort of criticism of degrees of belief, based as it is on considerations of
accuracy, should extend to properties other than coherence and incoherence.
It will help us show that Bayesianism is OK for science as well as business!
Notes
[1] This terminology is due to Isaac Levi.
[2] With c(E|X) and c(E|~X) given any values one can obtain any 0 ≤ p ≤ 1 as a value for
c(X|E) by setting c(X) = pc(E|~X)/[(1 – p)c(E|X) + pc(E|~X)]).
[3] More generally, Jeffreys showed this: Assume, for simplicity, the range of x is the
interval [2, 3]. Let t(x) = t be any differentiable increasing function of x. If our
probability for x is p(x) = 1/(xk) where k = 23 1/x dx = ln(3) – ln(2) and our probability
for t is q(t) = [p(t–1(t))dt–1(t)/dt]/n where n = t(2)t(3) [p(t–1(t))dt–1(t)/dt] dt
then for any [a, b]  [2, 3] one will have
p(x  [a, b]) = ab p(x) dx = t(a)t(b) q(t) dt = q(t  [t(a), t(b)])
Special case (change of distance scale): t(x) = ux for u > 0.
[4] When we start with x and think of Henri’s “real distance” as a transformation of x, t(x) =
ln(x) our invariant probabilities are p(x) = 1/(x(ln(3) – ln(2)) and q(t) = 1/(x(ln(3) – ln(2)).
And, we get p(x  [a, b]) = [ln(a) – ln(b)]/[ln(3) – ln(2)] = q(t  [ln(a), ln(b)]).
But, Henri who starts with t and treats meters as x(t) = et ends up with the different
probabilities p*(x) = 1/(xln(x)[ln(ln(3)) – ln(ln(2))]) and q*(t) = 1/(t[ln(ln(3)) – ln(ln(2))]).
And, p*(x  [a, b]) = [ln(ln(a)) – ln(ln(b))]/[ln(ln(3)) – ln(ln(3))] = q*(t  [t(a), t(b)]).
References
Bernardo, J. and Smith F. (1994) Bayesian Theory.
de Finetti, B. (1974) Theory of Probability.
Box, G.E.P. (1980), “Sampling and Bayes inference in scientific modeling and robustness”, Journal
of the Royal Statistical Society Series A 143, 383-430.
Dawid, A.P. (1982), "The well-calibrated Bayesian," Journal of the American Statistical
Association 77, 605-610.
Edwards W, Lindman H & Savage L. J. (1963) “Bayesian Statistical Inference for Psychological
Research,” Psychol. Review 70:193-242.
Jaynes, E. T. (1968) Prior probabilities IEEE Trans. Syst. Cybern. SSC-4 227
Jaynes, E T 1973 The well-posed problem Found. Phys. 3 477, reprinted in (Jaynes 1983)
Jeffreys, H. (1961) Theory of Probability.
Jeffreys, W. (2003) Journal of Scientific Exploration 17 (3): 537-42
Little, R. (2005) “Calibrated Bayes: A Bayes/Frequentist Roadmap”
http://sitemaker.umich.edu/rlittle/files/roadmap.pdf
Rubin, D. (1984), “Bayesianly justifiable and relevant frequency calculations for the applied
statistician”, Annals of Statistics 12, 1151-1172.
Rosenkrantz, R. (2006) “Bayesianism” in Sarkar and Pfeifer, eds., The Philosophy of Science: an
Encyclopedia (Routledge): 41-60.
Seidenfeld, T. (1986) “Entropy and Uncertainty,” Philosophy of Science 53: 467--491.
Download