Lecture 6: Decision-theoretic/Bayesian inference: Hypothesis testing for parameters, univariate case

Lecture 6: Decision-theoretic/Bayesian inference: Hypothesis testing for parameters, univariate case One-sided testing H 0 :  0 H1 :    0 or or   0    0  The Classical (frequentist) way: Decide upon what the level of significance ( α ) should be. Calculate      or Pr   0  g ˆ  g ôbs      Pr   0  g ˆ  g ôbs P-value where g is a function of the data involving a point estimate ˆ of . If the P-value ≤ α reject H0 in favour of H1 , otherwise do not reject (until sufficient information will make a rejection possible) Pros and Cons with the classical approach • Well-developed framework for several kinds of parameters in standard statistical software • Wide-spread and acknowledged in many scientific fields • Free (!) from subjective input • How many do actually understand the meaning of a P-value? It is one of the most misunderstood concepts of statistical theory. • The P-value is calculated as the probability of the set of values of the function g as extreme as or even more extreme than ôbs but the only thing observed is ˆ • Is it generally preferred to discard subjective inputs to the statistical inference? What if e.g. a proportion is known to always be less than a certain value (for physical or medical reasons). Allowing the whole range from 0 to 1 is then some kind of waste. A decision-theoretic approach to one-sided testing As we have seen before, the decision to be taken is to decide which of the two hypothesis that should be taken for true. Let d0 be the decision that H0 is true and d1 the decision that H1 is true. A general loss function suitable for one-sided testing 0  L d i ,     b k     0  i if H i is true if H i is false where b ≥ 0 and k0 and k1 can be differently chose to reflect different degrees of severity of making a wrongful decision. Choosing b = 1 is common – linear loss. Minimizing the expected loss (Bayesian hypothesis testing) Let h( ) be a probability density function representing the uncertainty about the true value of . h can be the prior density or the posterior density given data. Expected loss with decision d0:     0 k 0   0     h d L d 0 , h       0 k 0     0   h d if H 0     0 if H 0     0  k 0   0  Pr  h     0   k 0  0   h d if H 0     0   h       0  if H 0     0 k   h  d   k    Pr 0 0  0   0 Expected loss with decision d1:     0 k1     0   h d L d1 , h       0 k1   0     h d if H 0     0 if H 0     0  h  k1      0  if H 0     0   h  d   k    Pr 1 0    0  h     0   k1     h d if H 0     0 k    Pr  1 0 0 In particular, with h( ) = q( | x ), i.e, the posterior density  Posterior expected losses. Take action d0 if L d 0 , q  | x   L d1 , q  | x  …and take action d1 if The test is inconclusive if L d 0 , q  | x   L d1 , q  | x  L d 0 , q  | x   L d1 , q  | x  Example Assume we would like to test whether the proportion  of voters supporting a certain political party is at least 4 percent (the magical threshold for entering the Swedish Parliament upon general elections). An opinion poll among 200 respondents gave that 6 persons answered they supported this party. The history tells us that this party has for a previous period of 20 years always got between 4 and 8 percent of the votes. Hence it is wise to assign a beta prior for this proportion that has its centre around 6 percent and with a range of about 5 percent units. This can be interpreted as the mean being 6 (%) and the standard deviation being about 5/6 percent units. For a normal distribution an interval of length 6 covers 99.97 % of the variation if its placed symmetrically around the mean. Now, for a beta distribution with shape parameters a and b the mean  is a/(a + b) and the variance  2 is ab/[(a +b)2(a + b +1)]   1       1       a    1 and b  1    1   2 2       Hence, a reasonable prior distribution for  may be a beta distribution with parameters  0.061  0.06    0.061  0.06   a  0.06  1  49 and b  1  0.06   1  763 2 2  0.05 6    0.05 6   The posterior distribution given the data (6 out of 200) is again a beta distribution with parameters a* = a + 6 = 55 and b* = b + 200 – 6 = 957. Here we can use a symmetric linear loss function with k0 = k1 = 1 Thus, since H0 =  ≥ 0.04, we get the two expected posterior losses L d 0 , beta (55,957 )      0.04  54 1    956 0 .4 0.04   0 B 55,957   0.04  P beta ( 55 , 957  d  1  0.04      55 1    0 .4 956  B55,957  0  54 1    956 B 55,957  d  d  B 56,957  beta ( 56 ,957    0.04     0.04   3.02 E  5 P B 55,957  L d1 , beta (55,957 )      0.04  55 1    956 1  0.04  Since B 55,957  d  0.04 1    0.04   1  0.04  54 1    956 B 55,957   54 1    956 B 55,957  d  d  B 56,957  beta ( 56 ,957    0.04   0.04  P beta (55,957    0.04   0.014 P B 55,957  L d 0 , beta (55,957   L d1 , beta (55,957  action d0 should be taken, i.e. decide that the proportion of voters supporting the party is at least 4 percent. Compare with the classical significance test: ˆ  0.04 0.04  0.96 200  0.03  0.04 0.04  0.96 200  0.72 P - value :   0.72   0.24 Comparisons Consider the issue to test H0 :  1 ≥  2 (or  1   2 ) against H1:  1 <  2 (or  1 <  2) where  1 and  2 are the same type of parameter, but from two different populations. By rewriting H0 as  1 –  2 ≥ 0 (and correspondingly H1 as  1 –  2 < 0 we have transferred the problem back to a test of the value of the compound parameter  = 1 –  2 Statistical calculations for the “merging” of two populations will then be needed to sort out the problem – independent populations and independent samples gives product priors and product likelihoods. Two-sided testing So far the hypotheses specified through parameters have been of the kind H 0 :  0 H 1 :   1 (   \  0 ) where  is the parameter space. This may be viewed as a classification problem – the issue is to choose between two hypotheses (two models) the one that best describes the data given prior probabilities for both hypotheses. The specification through a parameter induces the need for integrating the likelihood function as soon as any of the hypotheses is composite. Very often, though, the issue is to analyse two sets of data material for making inference about whether they originate from the same model. H0 : Model for data set 1 = Model for data set 2 H1 : Model for data set 1  Model for data set 2 Not so much about the models themselves, but essentially about whether they are common or not. Examples • Are the effects of the two treatments equal or not? • Do the two seizures of amphetamine come from the same manufacturing batch? • Was the recovered shoeprint made by the suspects’ shoe ? In many situations a parametric model is common for the two data sets. The difference is then expressed in terms of one or several parameters. H 0 : 1   2 H 1 : 1   2 Assuming independent data sets the general likelihood function is L 1 ,  2 Data   L 1 Data set 1 L  2 Data set 2  However, the likelihood function for H0 is a function of one parameter value only: L H 0 Data   L  Data set 1 L  Data set 2  since H0 states 1 = 2 = . With prior density p( ) for  (common to both hypothesis) the Bayes factor becomes L  Data set 1 L  Data set 2  p   d  B  L Data set 1 L Data set 2 p   p  d d L  Data set 1 L  Data set 2  p   d    L Data set 1 p  d   L Data set 2 p  d   1 2 1    2 1 2  Example Assume we are comparing two seizures of cannabis with respect to their mean concentration of THC (Tetrahydrocannabinol, the active narcotic substance). Denote the two means 1 and 2 respectively. The concentration for the types of material in question (typically inflorescences) uses to vary between 5 and 20 % with a peak around 12 %. We therefore assign a prior density for the mean concentration of any material of this type as    ~ N  mean  12 , standard deviation  For symmetric close-to-normal distributions a reasonable estimate of the standard deviation is Range/6. The range is almost covered by the mean  3 standard deviations for a normal distribution. 20  5    N 12,2.5 %  6  To simply formulas denote the prior N( ,  ) where  = 12 and  , the standard deviation, is 2.5.     2  1  p     exp   2 2   2   Now, let’s say we have a method of measurement that gives a value x, which is normally distributed N(,  ) , where  is the true mean concentration and  is the standard deviation. Let’s further assume that the method has been validated to provide a standard deviation of about 0.1 percentage points  x ~ N(, 0.1 ) For n1 measurements x11, … , x1n1 on the first seizure we obtain the likelihood function  L 1 x11 ,..., x1n1    x1i  1 2  1   exp   2 2 2 i 1    n1  1  1    exp   2 2    2   2 n1  i 1  2   x    1i 1  This can be shown to be L  x 1 11 ,..., x1n1   g x  2  n1   ,..., x ,   exp  x    1 11 1n1 1 1  2  2  2  n  g1  exp  1 2 x1  1    2  where the function g1 does not depend on 1. Analogously, for n2 measurements x21, … , x2n2 on the second seizure we obtain the likelihood function     2  n L  2 x21 ,..., x2 n2  g 2 x21 ,..., x2 n2 ,   exp  2 2 x2   2     2  2  n  g 2  exp  2 2 x2  1    2  Hence, the Bayes factor is     2  1 2 2  n1  n2  g1  exp  2 2 x1      g 2  exp  2 2 x2       2  exp  2 2  d B        2      2  1 1 2 2  n1  n2  g1  exp  2 2 x1       2  exp  2 2  d   g 2  exp  2 2 x2       2  exp  2 2  d      2  2 2  n1  n2  exp  2 2 x1      exp  2 2 x2      exp  2 2  d        2      2  1 2 2  n1  n2 x1      exp  x2      exp   exp   d    exp   d 2 2 2 2 2  2  2  2   2            A little bit tedious to sort out Simplification when comparing two normal means Express the hypotheses as H 0 : 1   2    0 H 1 : 1   2    0   2  n L 1 x11 ,..., x1n1  g1  exp  1 2 x1  1    2  Above was shown that If we “reduce” the sample x11, … , x1n1 to its sample mean, i.e. x1 the likelihood function for 1 becomes L 1 x1   since  1 n1   x1  1 2   exp   2 2   n 2 1        x1 ~ N  mean  1 , standard deviation   n1   Note that   2  n L1 1 x11 ,..., x1n1  exp  1 2 x1  1    2   x1  1 2  2  n1   L 1 x1   exp   exp  x     1 1  2 2 2   n 2    1     Both likelihood functions are proportional to a common essential part, i.e. the part that contains 1. This is due to that the sample mean is a sufficient statistic for the population mean. Analogously 2  n L  2 x2   exp  2 2 x2  1    2  Now, for independent samples 2 2     1 x1  x2 ~ N  1   2 ,  2   n1 n2   With 1 – 2 =  and 1 = 2 =  we get  1 1  x1  x2 ~ N   ,    n1 n2   x1  x2 is a sufficient statistic for 1 – 2 =  and “reducing” the two data sets to x1  x2 gives the likelihood function for  as 2    1 x1  x2    L  | x1  x2    exp   2   2    1 n  1 n  1 n1  1 n2 2 1 2   The Bayes factor becomes B L   0 x1  x2   L x    1  x2  p   d  2    x1  x2  1  exp   2   n 1  n 1    2  1 n1  1 n2 2 2 1    x1  x2   2   2  1 1   1 n1  1 n2 2  exp  2   2  1 n1  1 n2    2 2  exp  4 2  d     2    x1  x2  exp   2   n 1  n 1    2 2  1   2   x1  x2      2  1  exp    exp   2  d 2   n 1  n 1    2  2 2    4  2  1  A little bit easier to sort out. Can be proven to be  1   2  n1  n2  2 1 1 2    B  1 2  exp    2  2  x  x  1 2 2        n1  n2  2   1 n  1 n   1 n  1 n  2  1 2 1 2     Inserting the values of  (0.1) and  (2.5) we obtain B  1  1   12.5  n1  n2 1 1 2   x1  x2    exp     0.01  n1  n2   2  0.01  1 n1  1 n2  0.01  1 n1  1 n2   12.5   Let’s say we have obtained the sample means 18.1 % and 18.2 % and that our prior odds for the two hypotheses are 1 (non-informative prior). n1 n2 B Posterior odds Classical P-value 2 2 21.46 21.46 0.3173 2 3 21.27 21.27 0.2733 5 5 16.03 16.03 0.1138 5 7 14.05 14.05 0.0877 10 10 6.49 6.49 0.0253 30 30 0.08 0.08 0.0001 100 100 0.00 0.00 0.0000

Lecture 6: Decision-theoretic/Bayesian inference: Hypothesis testing for parameters, univariate case

Related documents

Products

Support

Lecture 6: Decision-theoretic/Bayesian inference: Hypothesis testing for parameters, univariate case

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib