Lecture 6: Decision-theoretic/Bayesian inference: Hypothesis testing for parameters, univariate case

advertisement
Lecture 6:
Decision-theoretic/Bayesian inference:
Hypothesis testing for parameters, univariate
case
One-sided testing
H 0 :  0
H1 :    0
or
or
  0 
  0 
The Classical (frequentist) way:
Decide upon what the level of significance ( α ) should be.
Calculate
     or
Pr   0  g ˆ  g ˆobs
    
Pr   0  g ˆ  g ˆobs
P-value
where g is a function of the data involving a point estimate ˆ of .
If the P-value ≤ α reject H0 in favour of H1 , otherwise do not reject (until sufficient
information will make a rejection possible)
Pros and Cons with the classical approach
• Well-developed framework for several kinds of parameters in standard
statistical software
• Wide-spread and acknowledged in many scientific fields
• Free (!) from subjective input
• How many do actually understand the meaning of a P-value? It is one of the
most misunderstood concepts of statistical theory.
• The P-value is calculated as the probability of the set of values of the function
g as extreme as or even more extreme than ˆobs but the only thing observed is ˆ
• Is it generally preferred to discard subjective inputs to the statistical inference?
What if e.g. a proportion is known to always be less than a certain value (for
physical or medical reasons). Allowing the whole range from 0 to 1 is then
some kind of waste.
A decision-theoretic approach to one-sided testing
As we have seen before, the decision to be taken is to decide which of the two
hypothesis that should be taken for true.
Let d0 be the decision that H0 is true and d1 the decision that H1 is true.
A general loss function suitable for one-sided testing
0

L d i ,    
b
k




0
 i
if H i is true
if H i is false
where b ≥ 0 and k0 and k1 can be differently chose to reflect different degrees of
severity of making a wrongful decision. Choosing b = 1 is common – linear loss.
Minimizing the expected loss (Bayesian hypothesis testing)
Let h( ) be a probability density function representing the uncertainty about the
true value of . h can be the prior density or the posterior density given data.
Expected loss with decision d0:

   0 k 0   0     h d
L d 0 , h   
   0 k 0     0   h d
if H 0     0
if H 0     0

k 0   0  Pr  h     0   k 0
 0   h d if H 0     0


h 


   0  if H 0     0
k


h

d


k



Pr
0
0
 0   0
Expected loss with decision d1:

   0 k1     0   h d
L d1 , h   
   0 k1   0     h d
if H 0     0
if H 0     0

h 
k1


   0  if H 0     0


h

d


k



Pr
1
0
   0

h 
   0   k1     h d if H 0     0
k



Pr
 1 0
0
In particular, with h( ) = q( | x ), i.e, the posterior density  Posterior expected
losses.
Take action d0 if
L d 0 , q  | x   L d1 , q  | x 
…and take action d1 if
The test is inconclusive if
L d 0 , q  | x   L d1 , q  | x 
L d 0 , q  | x   L d1 , q  | x 
Example
Assume we would like to test whether the proportion  of voters supporting a
certain political party is at least 4 percent (the magical threshold for entering
the Swedish Parliament upon general elections). An opinion poll among 200
respondents gave that 6 persons answered they supported this party.
The history tells us that this party has for a previous period of 20 years always
got between 4 and 8 percent of the votes. Hence it is wise to assign a beta prior
for this proportion that has its centre around 6 percent and with a range of
about 5 percent units. This can be interpreted as the mean being 6 (%) and the
standard deviation being about 5/6 percent units.
For a normal distribution an interval of length 6 covers 99.97 % of the
variation if its placed symmetrically around the mean.
Now, for a beta distribution with shape parameters a and b the mean  is a/(a +
b) and the variance  2 is ab/[(a +b)2(a + b +1)]
  1    
  1    


a  

1
and
b

1


 1


2
2
 

 

Hence, a reasonable prior distribution for  may be a beta distribution with
parameters
 0.061  0.06  
 0.061  0.06  
a  0.06
 1  49 and b  1  0.06 
 1  763
2
2
 0.05 6 

 0.05 6 

The posterior distribution given the data (6 out of 200) is again a beta distribution
with parameters a* = a + 6 = 55 and b* = b + 200 – 6 = 957.
Here we can use a symmetric linear loss function with k0 = k1 = 1
Thus, since H0 =  ≥ 0.04, we get the two expected posterior losses
L d 0 , beta (55,957 )   
  0.04
 54 1   
956
0 .4
0.04  
0
B 55,957 
 0.04  P
beta ( 55 , 957 
d 
1  0.04    
 55 1   
0 .4
956
 B55,957 
0
 54 1   
956
B 55,957 
d 
d 
B 56,957  beta ( 56 ,957 
  0.04  
  0.04   3.02 E  5
P
B 55,957 
L d1 , beta (55,957 )   
  0.04
 55 1   
956
1

0.04

Since
B 55,957 
d  0.04
1    0.04  
1

0.04
 54 1   
956
B 55,957 
 54 1   
956
B 55,957 
d 
d 
B 56,957  beta ( 56 ,957 
  0.04   0.04  P beta (55,957    0.04   0.014
P
B 55,957 
L d 0 , beta (55,957   L d1 , beta (55,957 
action d0 should be
taken, i.e. decide that the proportion of voters supporting the party is at least
4 percent.
Compare with the classical significance test:
ˆ  0.04
0.04  0.96
200

0.03  0.04
0.04  0.96
200
 0.72
P - value :   0.72   0.24
Comparisons
Consider the issue to test H0 :  1 ≥  2 (or  1   2 ) against H1:  1 <  2
(or  1 <  2) where  1 and  2 are the same type of parameter, but from two different
populations.
By rewriting H0 as  1 –  2 ≥ 0 (and correspondingly H1 as  1 –  2 < 0 we
have transferred the problem back to a test of the value of the compound
parameter  = 1 –  2
Statistical calculations for the “merging” of two populations will then be needed
to sort out the problem – independent populations and independent samples gives
product priors and product likelihoods.
Two-sided testing
So far the hypotheses specified through parameters have been of the kind
H 0 :  0
H 1 :   1 (   \  0 )
where  is the parameter space.
This may be viewed as a classification problem – the issue is to choose between
two hypotheses (two models) the one that best describes the data given prior
probabilities for both hypotheses.
The specification through a parameter induces the need for integrating the
likelihood function as soon as any of the hypotheses is composite.
Very often, though, the issue is to analyse two sets of data material for making
inference about whether they originate from the same model.
H0 : Model for data set 1 = Model for data set 2
H1 : Model for data set 1  Model for data set 2
Not so much about the models themselves, but essentially about whether they are
common or not.
Examples
• Are the effects of the two treatments equal or not?
• Do the two seizures of amphetamine come from the same manufacturing
batch?
• Was the recovered shoeprint made by the suspects’ shoe ?
In many situations a parametric model is common for the two data sets.
The difference is then expressed in terms of one or several parameters.
H 0 : 1   2
H 1 : 1   2
Assuming independent data sets the general likelihood function is
L 1 ,  2 Data   L 1 Data set 1 L  2 Data set 2 
However, the likelihood function for H0 is a function of one parameter value only:
L H 0 Data   L  Data set 1 L  Data set 2 
since H0 states 1 = 2 = .
With prior density p( ) for  (common to both hypothesis) the Bayes factor
becomes
L  Data set 1 L  Data set 2  p   d

B
 L Data set 1 L Data set 2 p   p  d d
L  Data set 1 L  Data set 2  p   d


 L Data set 1 p  d   L Data set 2 p  d


1
2
1



2
1
2

Example
Assume we are comparing two seizures of cannabis with respect to their mean
concentration of THC (Tetrahydrocannabinol, the active narcotic substance).
Denote the two means 1 and 2 respectively.
The concentration for the types of material in question (typically inflorescences)
uses to vary between 5 and 20 % with a peak around 12 %. We therefore assign a
prior density for the mean concentration of any material of this type as


 ~ N  mean  12 , standard deviation 
For symmetric close-to-normal distributions a
reasonable estimate of the standard deviation is
Range/6.
The range is almost covered by the mean  3
standard deviations for a normal distribution.
20  5 
  N 12,2.5 % 
6 
To simply formulas denote the prior N( ,  ) where  = 12 and  , the standard
deviation, is 2.5.
    2 
1
 p   
 exp 

2
2

 2


Now, let’s say we have a method of measurement that gives a value x, which is
normally distributed N(,  ) , where  is the true mean concentration and  is
the standard deviation.
Let’s further assume that the method has been validated to provide a standard
deviation of about 0.1 percentage points  x ~ N(, 0.1 )
For n1 measurements x11, … , x1n1 on the first seizure we obtain the likelihood
function

L 1 x11 ,..., x1n1

  x1i  1 2 
1

 exp 

2
2
2
i 1 


n1
 1
 1 


exp


2
2

  2 

2
n1

i 1

2


x


 1i 1 
This can be shown to be
L  x
1
11
,..., x1n1
  g x

2
 n1


,...,
x
,


exp

x



1 11
1n1
1
1 
2
 2

2
 n
 g1  exp  1 2 x1  1  
 2

where the function g1 does not depend on 1.
Analogously, for n2 measurements x21, … , x2n2 on the second seizure we obtain the
likelihood function




2
 n
L  2 x21 ,..., x2 n2  g 2 x21 ,..., x2 n2 ,   exp  2 2 x2   2   
 2

2
 n
 g 2  exp  2 2 x2  1  
 2

Hence, the Bayes factor is
    2 
1
2
2
 n1
 n2
 g1  exp  2 2 x1      g 2  exp  2 2 x2       2  exp  2 2  d
B



    2 
    2 
1
1
2
2
 n1
 n2
 g1  exp  2 2 x1       2  exp  2 2  d   g 2  exp  2 2 x2       2  exp  2 2  d

    2 
2
2
 n1
 n2
 exp  2 2 x1      exp  2 2 x2      exp  2 2  d



    2 
    2 
1
2
2
 n1
 n2
x1      exp 
x2      exp 
 exp 
 d    exp 
 d
2
2
2
2
2

2

2

2

 2  









A little bit tedious to sort out
Simplification when comparing two normal means
Express the hypotheses as
H 0 : 1   2    0
H 1 : 1   2    0


2
 n
L 1 x11 ,..., x1n1  g1  exp  1 2 x1  1  
 2

Above was shown that
If we “reduce” the sample x11, … , x1n1 to its sample mean, i.e. x1
the likelihood function for 1 becomes
L 1 x1  
since

1
n1

 x1  1 2 
 exp 

2
2


n
2
1 




 
x1 ~ N  mean  1 , standard deviation 

n1 

Note that


2
 n
L1 1 x11 ,..., x1n1  exp  1 2 x1  1  
 2

 x1  1 2 
2
 n1


L 1 x1   exp 

exp

x




1
1 
2
2
2


n
2



1 



Both likelihood functions are proportional to a common essential part, i.e. the part
that contains 1.
This is due to that the sample mean is a sufficient statistic for the population mean.
Analogously
2
 n
L  2 x2   exp  2 2 x2  1  
 2

Now, for independent samples
2
2 



1
x1  x2 ~ N  1   2 ,
 2 

n1 n2 

With 1 – 2 =  and 1 = 2 =  we get

1 1 
x1  x2 ~ N   , 
 
n1 n2 

x1  x2 is a sufficient statistic for 1 – 2 =  and “reducing” the two data sets to
x1  x2 gives the likelihood function for  as
2



1
x1  x2   
L  | x1  x2  
 exp 

2


2



1
n

1
n
 1 n1  1 n2 2
1
2


The Bayes factor becomes
B
L   0 x1  x2 
 L x



1
 x2  p   d

2



x1  x2 
1
 exp 

2


n
1

n
1



2
 1 n1  1 n2 2
2
1


 x1  x2   2 
 2 
1
1
  1 n1  1 n2 2  exp  2   2  1 n1  1 n2    2 2  exp  4 2  d




2



x1  x2 
exp 

2


n
1

n
1



2
2 
1


2

 x1  x2   

 2 
1
 exp 
  exp   2  d
2


n
1

n
1



2
 2 2  
 4 
2 
1

A little bit easier to sort out. Can be proven to be
 1 

2  n1  n2  2
1
1
2



B  1 2
 exp    2
 2

x

x

1
2
2 




  n1  n2 
2


1
n

1
n


1
n

1
n

2

1
2
1
2




Inserting the values of  (0.1) and  (2.5) we obtain
B  1
 1 

12.5  n1  n2
1
1
2
  x1  x2  
 exp   

0.01  n1  n2 
 2  0.01  1 n1  1 n2  0.01  1 n1  1 n2   12.5 

Let’s say we have obtained the sample means 18.1 % and 18.2 % and that our prior
odds for the two hypotheses are 1 (non-informative prior).
n1
n2
B
Posterior
odds
Classical
P-value
2
2
21.46
21.46
0.3173
2
3
21.27
21.27
0.2733
5
5
16.03
16.03
0.1138
5
7
14.05
14.05
0.0877
10
10
6.49
6.49
0.0253
30
30
0.08
0.08
0.0001
100
100
0.00
0.00
0.0000
Download