Notes 2 - Wharton Statistics Department

advertisement
Statistics 550 Notes 2
Reading: Section 1.2.
I will add one office hour, Wed., 9-10.
Prof. Small’s office hours: Tues., 4:45-5:45; Wed., 9-10;
Thurs., 4:45-5:45.
TA Dan Yang’s office hours, 431.2 Huntsman Hall, Tues.,
2-3.
I. Frequentist vs. subjective probability (Section 1.2)
Model: We toss a coin 3 times. The tosses are iid Bernoulli
trials with probability p of landing heads.
What does p mean here?
Mathematical definition – A probability function for a
random experiment is a function on the sample space S of
possible outcomes of the experiment that satisfies:
Axiom 1: For all (measurable) events E: 0  P ( E )  1
Axiom 2: P( S )  1 Countable additivity
Axiom 3: For any sequence of mutually exclusive
  
(measurable) events, P  Ei    P( Ei ) .
 i 1  i 1
1
For a single coin toss, the axioms imply that p must be
between 0 and 1 and the probability of the coin landing
tails must be 1  p .
Meaning of probability as a model for the real world:
Frequentist probability – In “many” independent coin
tosses, the proportion of heads would be about p .
 The French naturalist Count Buffon (1707-1788)
tossed a coin 4040 times. Result: 2048 heads, or
relative frequency 2048/4040=0.5069 for heads.
 Around 1900, the English statistician Karl Pearson
heroically tossed a coin 24,000 times. Result: 12,012
heads, a relative frequency of 0.5005.
 While imprisoned by the Germans during World War
II, the Australian mathematician John Kerrich tossed a
coin 10,000 times. Result: 5067 heads, a relative
frequency of 0.5067.
Subjective (personal) probability – Probability describes a
person’s degree of belief about a statement.
“The probability that the coin lands heads is 0.5”
Subjective interpretation – This represents a person’s
personal degree of belief that a toss of the coin will land
heads.
Specifically, if offered the chance to make a bet in which
the person will win 1  p dollars if the coin lands heads and
2
lose p dollars if the coin lands tails, the maximum p the
person would play the game with is p  0.5 dollars.
In general, a person’s subjective probability of an event A ,
P( A) , is the value of p for which the person thinks a bet in
which she will win 1  p dollars if A occurs and lose
p dollars if A does not occur is a fair bet (i.e., has expected
profit zero).
Subjective probability rejects the view of probability as a
physical feature of the world and interprets probability as a
statement about an individual’s state of knowledge.
“Coins don’t have probabilities, people have probabilities.”
– Persi Diaconis.
An advantage of subjective probability is that it can be
applied to things about which we are uncertain which
cannot be envisioned as being part of a sequence of
repeated trials:
“Will the Philadelphia Eagles win the Super Bowl this
year?”
“Was there life on Mars 1 billion years ago?”
“Did Lee Harvey Oswald act alone in assassinating John F.
Kennedy?”
“What is the 353rd digit of  ?”
If I say the probability that the 353rd digit of  is 5 is 0.2, I
mean that I would consider it a fair bet for me to gain 0.8
3
dollars if the 353rd digit of  is 5 and lose 0.2 dollars if the
353rd digit of  is not 5.
II. Coherence and the axioms of probability
A rational person should have a “coherent” system of
subjective probabilities: a system is said to be incoherent if
there exists some set of bets such that the bettor will take
the set of bets but will lose no matter what happens.
If a person’s system of probabilities is coherent, then they
must satisfy the axioms of probability.
Example:
Proposition 1: If A and B are mutually exclusive events,
then P( A  B)  P( A)  P( B) .
Proof: Let P( A)  p1 , P( B)  p2 and P( A  B)  p3 be a
person’s subjective probabilities for these events. Suppose
that p3 differs from p1  p2 . Then the person thinks that
the following bets are fair: (i) a bet in which the person will
win 1  p1 dollars if A occurs and lose p1 dollars if A does
not occur; (ii) a bet in which the person will win 1  p2 if
B occurs and lose p2 dollars if B does not occur; and (iii) a
bet in which the person will win 1  p3 if A  B occurs and
lose p3 dollars if A  B does not occur
4
Say, p3  p1  p2 and let the difference be
d  ( p1  p2 )  p3 .
A gambler offers this person the following bets: (a) the
person will lose 1  p3  d / 4 dollars if A  B occurs and
win p3  d / 4 if A  B does not occur; (b) the person will
win 1  p1  d / 4 dollars if A occurs and lose
p1  d / 4 dollars if A does not occur; (c) the person will
win 1  p2  d / 4 dollars if B occurs and lose
p2  d / 4 dollars if B does not occur.
According to the person’s subjective probabilities, bets (a)(c) are all expected to yield a profit, so the person takes all
of them.
However,
(1) Suppose A occurs. Then the person loses
1  p3  d / 4 from bet (a), wins 1  p1  d / 4 from bet (b)
and loses p2  d / 4 from bet (c). The person’s profit is
(1  p1  d / 4)  (1  p3  d / 4)  ( p2  d / 4) 
p3  3d / 4  p1  p2 
( p1  p2  d )  3d / 4  p1  p2  d / 4
(2) Suppose B occurs. The person’s profit is
(1  p2  d / 4)  (1  p3  d / 4)  ( p1  d / 4) 
p3  3d / 4  p1  p2 
( p1  p2  d )  3d / 4  p1  p2  d / 4
5
(3) Suppose neither A nor B occurs. The person’s profit is
( p3  d / 4)  ( p1  d / 4)  ( p2  d / 4) 
p3  p1  p2  3d / 4 
( p1  p2  d )  p1  p2  3d / 4  d / 4
Thus, the gambler has put the person in a position in which
the person is guaranteed to lose d / 4 no matter what
happens (when a person accepts a set of bets that
guarantees that they will lose no matter what happens, it is
said that a Dutch book has been set against them). So it is
irrational for the person to assign P( A  B)  P( A)  P( B) .
A similar argument can be made that it is irrational to
assign P( A  B)  P( A)  P( B) .
Similar arguments can be made that a rational person’s
subjective probabilities should satisfy the other axioms of
probability: (1) for an event E , 0  P ( E )  1 ; (2) P( S )  1 ,
where S is the sample space.
Although, from Proposition 1, it is clear that a rational
person’s personal probabilities should obey finite additivity
(i.e., if E1 , , En are mutually exclusive events,
 n  n
P  Ei    P( Ei ) ), there is some controversy about
 i 1  i 1
additivity for a countable infinite sequence of mutually
exclusive events (see J. Williamson, “Countable Additivity
and Subjective Probability,” The British Journal for the
6
Philosophy of Science, 1999). We assume countable
additivity holds for subjective probability.
The mathematical axioms of probability, and hence all
results in probability theory, hold for both subjective
probabilities and frequentist probabilities -- it is just a
matter of how we interpret the probabilities.
In particular, Bayes theorem holds for subjective
probability.
Let A be an event and B1 , , Bn be mutually exclusive and
exhaustive events where P( Bi )  0 for all i . Then
P( A | B j ) P( B j )
P( B j | A)  n
P( A | B ) P( B ) .

i 1
i
i
III. The Bayesian Approach to Statistics
Example 1 from Notes 1: Yao Ming’s free throws in the
2007-2008 season and future seasons are iid Bernoulli trials
with probability p of success. In the 2007-2008 season,
Yao made 345 out of the 406 free throws he attempted
(85.0%). What inferences can we make about p ?
7
The Bayesian approach to statistics uses subjective
probability and regards p as being a fixed but unknown
object over which we have beliefs given by a subjective
probability distribution and then uses the data to update our
beliefs using Bayes rule.
Prior distribution: Subjective probability distribution
about parameter vector ( p ) before seeing any data.
Posterior distribution: Updated subjective probability
distribution about parameter vector ( p ) after seeing the
data.
Bernoulli trials example: Suppose that X 1 ,
Bernoulli with probability of success p .
, X n are iid
We want our prior distribution for p to be a distribution
that concentrates on the interval [0,1]. A class of
distributions that concentrates on [0,1] is the two-parameter
beta family.
The beta family of distributions: The density function f of
the beta distribution with parameters r and s is
(r  s) r 1
f ( x) 
x (1  x) s 1 , 0  x  1
 ( r ) ( s )
The mean and variance of the Beta(r,s) distribution are
rs
r
r  s and (r  s  1)(r  s) 2 respectively.
See Appendix B.2 for details.
8
Suppose our prior distribution for p is Beta(r,s) with
density  ( p) and we observe X1  x1 , , X n  xn .
Using Bayes theorem, our posterior distribution for p is
 ( p | x1 , , xn ) 
 ( p) f ( x1
1
  (t ) f ( x ,
1
0
  ( p) f ( x1
, xn | p)
, xn | t )dt
, xn | p)
n
n
  n xi
(r  s) r 1
n   xi
s 1
i 1
i 1

p (1  p)  n  p
(1  p)
  xi 
 ( r ) ( s )
 i 1 
p
r 1
 i1 xi (1  p) s 1 n  i1 xi
n
n
The last expression is proportional to the
n
n
r

x
,
s

n

Beta( i 1 i
i1 xi ) density so the posterior
density for p is Beta( r  i 1 xi , s  n  i 1 xi ).
n
n
9
Families of distributions for which the prior and posterior
distributions belong to the same family are called conjugate
families.
r   i 1 xi
n
The posterior mean is thus equal to r  s  n and the
(r   i 1 xi )( s  n   i 1 xi )
n
posterior variance is
n
(r  s  n) 2 (r  s  n  1) .
Returning to our example in which Yao Ming made 345 out
of the 406 free throws he attempted (85.0%) in 2007-2008,
if we had a Beta(1,1) [uniform] prior, then the posterior
distribution for Yao’s probability of making a free throw in
the next season is (345+1,406-345+1)=Beta(346,62). The
posterior mean is 0.848 and the posterior standard deviation
is 0.018.
A valuable feature of the Bayesian approach is its ability to
incorporate prior information.
Returning to our example in which Yao Ming made 345 out
of the 406 free throws he attempted (85.0%) in 2007-2008,
we could use information from Yao’s previous seasons.
2002-2003
Free Throws
Made
301
Free Throws
Attempted
371
10
FT%
0.811
2003-2004
2004-2005
2005-2006
2006-2007
Total
361
389
337
356
1744
446
497
395
413
2122
0.809
0.783
0.853
0.861
0.822
A way to think about the Beta( r , s ) prior is the following:
Suppose that if we had no informative prior information on
p , we would use a uniform (Beta (1,1)) prior. Then a
Beta( r , s ) prior means that our prior information is
equivalent to seeing r  1 successes and s  1 failures prior
to seeing the current data. So, for example, a Beta(82,18)
prior is equivalent to saying that our prior information is
equivalent to seeing 81 made free throws and 17 missed
free throws prior to seeing our data. If we consider all the
free throws from seasons prior to 2007-2008 to be
equivalent to free throws in 2007-2008, we would use a
Beta(1744+1,2122-1744+1) =Beta (1745, 379) prior. If we
consider each free throw in prior seasons to be equivalent
to half a free throw of information, we would use a Beta
(1744*0.5+1, (2122-1744)*0.5+1) =Beta (873, 190) prior.
11
When there is more data and prior beliefs are less strong,
the prior distribution does not play as strong a role.
The posterior mean is
r   i 1 xi
n

n
n
r
rs

rsn
n rsn rs rsn
The posterior mean is a weighted average of the sample
mean and the prior mean, with the weight on the sample
mean increasing to 1 as n   .

x
i 1 i
12
Example 2: A cofferdam protecting a construction site was
designed to withstand flows of up to 1870 cubic feet per
second (cfs). An engineer wishes to compute the
probability that the dam will be overtopped during the
upcoming year. Over the previous 25-year period, the
annual maximum flood levels of the dam had ranged from
629 to 4720 cfs and 1870 cfs had been exceeded 5 times.
Modeling the 25 years as 25 independent Bernoulli trials
with the same probability p that the flood level will exceed
1870 cfs and using a uniform prior distribution for
p (which corresponds to a Beta (1,1)), the prior and
posterior densities are
13
IV. Bayesian Inference for the Normal Distribution
2
2
Suppose X 1 , , X n iid N ( ,  ) ,  known, and our prior
2
on  is N (  , b ) .
The posterior distribution is proportional to
14
f ( x |  ) (  ) 
 n
1
1
 1
 1
2  
2 
exp

(
x


)
exp

(



)
i



2
2




2

2
b

   2 b


 i=1 2
1
n
 1

exp  2   i 1 X i2  2nX   n 2   2  2  2   2   
 2b
 2 

  nX  
1 
 n
exp    2  2    2  2  2   
b 
2b  
 2
 
2
 1
nX /  2   / b 2   n
1 

exp    


 2
n /  2  1 / b 2    2 b 2  


Thus, the posterior distribution is
 nX 

n
1


  2 b2

2
2

1
1

b
N
,

,
 NX
n
1
n
1 n
1
 n  1 n  1 




2
b2  2 b2 
 2 b2  2 b2
  2 b2

Note that the posterior mean is a weighted average of the
sample mean and the prior mean, with the weight on the
sample mean increasing to 1 as n   .
V. Bickel and Doksum’s perspective on Bayesian models.
(a) Bayesian models are useful as a way to generate
statistical procedures which incorporate prior information
when appropriate.
However, statistical procedures should be evaluated in a
frequentist (repeated sampling) way:
15





For example, for the iid Bernoulli trials example, if we use
the posterior mean with a uniform prior distribution to
1   i 1 xi
n
estimate p , i.e., pˆ 
n  1 , then we should look at
how this estimate would perform in many repetitions of the
experiment if the true parameter is p for various values of
p . More to come on this frequentist perspective in
Chapter 1.3.
(b) We can view the parameter as random and view the
Bayesian model as providing a joint probability distribution
on the parameter and the data in a frequentist probability
sense.
Consider the frequentist probability model
P  P = { P , } for the probability distribution of the
data X .
The subjective Bayesian perspective is that there is a true
unknown  and our goal is to describe our beliefs about
 after seeing the data X . This requires specifying a prior
distribution  ( ) for our beliefs about  ; the posterior
distribution describes our beliefs about  after seeing the
data X .
Bickel and Doksum’s viewpoint is to see a Bayesian model
as a model for the joint probability distribution of ( , X )
where  is considered random. Specifically, a Bayesian
model consists of a specification of the marginal
16
distribution of  , which is the prior distribution  ( ) and
a specification of the conditional distribution of the data
X |  , i.e., P .
For example, if the data X 1 , , X n are iid Bernoulli trials
with probability p of success and the prior distribution for
p is Beta(r,s), then the joint probability distribution for
( p, X1 , , X n ) is generated by:
1. We first generate p from a Beta(r,s) distribution.
2. Conditional on p , we generate X 1 , , X n iid Bernoulli
trials with probability p of success.
Although the conceptual basis for regarding  as random in
a frequentist sense is not clear, this point of view is useful
for developing properties of Bayesian procedures.
17
Download