Notes 16 - Wharton Statistics Department

advertisement

Statistics 550 Notes 16

Reading: Sections 4.1-4.2

I. Hypothesis Testing Motivation and Framework

Motivating Example: A graphologist claims to be able to distinguish the writing of a schizophrenic person from a nonschizophrenic person. The graphologist is given a set of 10 folders, each containing handwriting samples of two persons, one nonschizophrenic and the other schizophrenic.

Her task is to identify which of the writings are the work of the schizophrenics. When this experiment was actually performed, the graphologist made 6 correct identifications

(Pascal and Suttell, 1947, Journal of Personality ). Is there strong evidence that the graphologist is able to better identify the writing of schizophrenics better than a person who was randomly guessing would?

Probability model: Let p be the probability that the graphologist successfully identifies the writing of a randomly chosen schizophrenic vs. nonschizophrenic person. A reasonable model is that the 10 trials are iid

Bernoulli with probability of success p .

Hypotheses: A randomly guessing person would have probability p

0.5

of correctly identifying a schizophrenic’s writing. The graphologist is claiming that she has p

0.5

.

1

General setup of hypothesis testing problems

Model: We observe data X from distribution (

 where

  is unknown.

Goal: Decide between two hypotheses about a parameter of interest



H

0

:

 

0

H

1

:

  

1

, (Bickel and Doksum denotes “ H

1

” by “ K ” instead;

H

1

is also denoted by

H a

sometimes) where

0

1

 

.

For the motivating example,

H p

0

: 0.5

and

H p

1

: 0.5

(we assume that p cannot be less than 0.5).

Null vs. alternative hypothesis: The alternative hypothesis should be the hypothesis that we’re trying to establish

(alternative hypothesis is sometimes called the research hypothesis). The null hypothesis should be the hypothesis that we will retain if there’s not strong evidence in favor of the alternative.

Courtroom analogy:

H

0

: defendant is innocent,

H

1

: defendant is guilty.

Test statistic: On the basis of the data, we will either decide that there is strong evidence in favor of the alternative hypothesis (so that we can reject the null hypothesis) or that there is not strong evidence in favor of the alternative hypothesis, in which case we will retain the null hypothesis

2

(also called “accepting the null” or “failing to reject the null”). To make this decision, we use a test statistic that is a function of the data. If our sample is

X

1

, , X n

, our test statistic is a function

W X

1

, , X n

)

.

Bickel and Doksum adopt the convention that a test statistic

“tends” to be small if H

0 is true and large if H

1 is true.

In the motivating example, a natural test statistic is

W X

1

, , X

10

)

 

10 i

1

X i

, the number of successful identifications of schizophrenics the graphologist makes.

The observed value of this test statistic is 6.

Critical region: Region of values C of the test statistic for which we reject the null hypothesis, e.g., if

C

{6, 7,8, 9,10} , we would reject the null hypothesis if the graphologist successfully identified 6 or more schizophrenics in the 10 trials (we would thus reject the null hypothesis for the actual data.

Errors in Hypothesis Testing:

Decision

True State of Nature

H

0 is true

H

1 is true

Reject

H

0

Type I error Correct decision

Accept (retain) H

0

Correct decision Type II error

The best critical region would make the probability of a

Type I error small when

H

0 is true and the probability of a

3

Type II error small when H

1 is true. But in general there is a tradeoff between these two types of errors.

Size of a test: We say that a test with test statistic

W X

1

, , X n

) and critical region C is of size

 if

  max

 

0

[ (

1

, , X n

)

C ]

.

The size of the test is the maximum probability of making a

Type I error when the null hypothesis is true (where the maximum is taken over all

that are part of the null hypothesis; for the motivating example, the null hypothesis only has one value of

in it). A test with size

  is said to be a test of level (of significance)

.

Suppose we use a critical region C

{6, 7,8, 9,10} with the test statistic

W X

1

, , X

10

)

  10 i

1

X i for the motivating example. The size of the test is

P p

0.5

( Y

6)

0.377

where

Y has a binomial distribution with n=10 and probability p=0.5.

Power: The power of a test at an alternative

 

1 probability of making a correct decision when is the

 is the true parameter (i.e., the probability of not making a Type II error when

 is the true parameter).

The power of the test with test statistic

W X

1

, , X

10

)

 

10 i

1

X i

and critical region

C

{6, 7,8, 9,10} at p=0.6 is

P p

0.6

( Y

6)

0.633

and at

4

p=0.7 is

P p

0.7

( Y

6)

0.850

where Y has a binomial distribution with n=10 and probability p. The power depends on the specific parameter in the alternative hypothesis that is being considered.

Power function:

 

C

( )

[ (

1

, , X n

)

C ];

 

1

.

Neyman-Pearson paradigm: Set the size of the test to be at most some small level, typically 0.10, 0.05 or 0.01 (most commonly 0.05) in order to protect against Type I errors.

Then among tests that have this size, choose the one that has the “best” power function. Below, we will define more precisely what we mean by “best” power function and derive optimal tests for certain situations.

For the test statistic

W X

1

, , X

10

)

 

10 i

1

X i

, the critical region C

{6, 7,8, 9,10} has a size of 0.377; this gives too high a probability of Type I error. The critical region

C

{8, 9,10} has a size of 0.0547, which makes the probability of a Type I error reasonably small. Using

C

{8, 9,10} , we retain the null hypothesis for the actual experiment for which W was equal to 6.

P-value: For a test statistic

W X

1

, , X n

)

, consider a family of critical regions {

C

:

 

} each with different sizes. For the observed value of the test statistic

W obs

from the sample, consider the subset of critical regions for which we would reject the null hypothesis, { 

: obs

C

}

. The

5

p-value is the minimum size of the tests in the subset

{

: obs

C

} , p-value = min

{

: obs

C

}

Size(test with critical region C

)

.

The p-value is a measure of how much evidence there is against the null hypothesis; it is the minimum significance level for which we would still reject the null hypothesis.

Consider the family of critical regions

C i

{

10 i

1

X i

 i } for the motivating example. Since the graphologist made 6 correct identifications, we reject the null hypothesis for critical regions C i

6 . The minimum size of the critical regions

C i i

6

is for i=6 and equals 0.377. The p-values is thus 0.377.

Scale of evidence p-value

<0.01

0.01-0.05

0.05-0.10

>0.1

Evidence very strong evidence against the null hypothesis

Strong evidence against the null hypothesis weak evidence against the null hypothesis little or no evidence against the null hypothesis

Warnings:

(1) A large p-value is not strong evidence in favor of H

0

.

A large p-value can occur for two reasons: (i)

H

0

is true or

6

(ii) H

0

is false but the test has low power at the true alternative.

(2) Do not confuse the p-value with

P H

0

| Data)

. The pvalue is not the probability that the null hypothesis is true.

II. Testing simple versus simple hypotheses: Bayes procedures

Consider testing a simple null hypothesis

H

0 versus a simple alternative hypothesis

H

1 under the null hypothesis

X ~ P ( X |

 

0

)

:

P

0

:

 

0

 

1

, i.e., and under the alternative hypothesis

X ~ P ( X |

 

1

) P

1

.

Example 1:

X

1

, , X n iid N

P

1

0.3 0.3 0.2

.

H

0

:

 

0.1

0

,

H

1

:

 

1

Example 2: X has one of the following two distributions:

P

0

0

0.1

1

0.1

P(

2

0.1

X = x )

3

0.2

4

0.5

0.1

.

Bayes procedures: Consider 0-1 loss, i.e., the loss is 1 if we choose the incorrect hypothesis and 0 if we choose the correct hypothesis. Let the prior probabilities be

 on

0 and 1

  on

1

. The posterior probability for

0 is

7

P (

0

| X )

P ( X |

P ( X |

  

0

) (

0

)

  

0

) (

0

)

P ( X |

  

1

) ( )

1

.

The posterior risk of choosing hypothesis a , i.e.,

 a x

 

P (

  a

| X )

, is

The posterior risk for 0-1 loss is minimized by choosing the hypothesis with higher posterior probability.

Thus, the Bayes rule is to choose H

0

(equivalently

0

P ( X |

  

0 0

P ( X |

  

1

) ( )

1

H

1

(equivalently

1

) otherwise. and choose

) if

For

P ( X

 

|

0

0

)

 

P ( X

1

|

1

)

1

2

, the Bayes rule is choose H

0 if and choose H

1 otherwise.

Note that the Bayes risk for the prior

 

0

  

1

1

2 of a test is 0.5*P(Type I error)+0.5*P(Type II error). Thus, the

Bayes procedure for the prior

 

0

  

1

1

2 minimizes the sum of the probability of a type I error and the probability of a type II error.

8

Example 1 continued: Suppose

( X

1

, , X

5

)

(1.1064,

1.1568, -0.1602, 1.0343, -0.1079), X

0.6059

( |

 

0)

1

2

 5 exp

 

5 i

1

( X i

0)

2

2

 

0.001613

. Then

( |

  

1

2

 5 exp

 

5 i

1

( X i

1) 2

2

 

0.002739

We choose

H

0

:

 

0

if

P ( X |

    

0)

P ( X |

P ( X |

P ( X |

    

0)

    

1)

1 ,

    

1) , or equivalently which for this data is

H

0

:

 

0

  

0)

  

1)

1 . Writing

 

1) (1

  

0)) , we have that we choose

for priors with (

0)

0.6294

and

H

1

:

 

1 for priors with (

0)

0.6294

.

III. Neyman-Pearson Lemma (Section 4.2)

In the Neyman-Pearson paradigm, the hypotheses are not treated symmetrically: Fix 0

  level (Type I error probability)

1 . Among tests having

, find the one that has the

“best” power function.

For simple vs. simple hypothesis, the best power function means the best power at

H

1

:

 

1

. Such a test is called the most powerful level

 test.

9

The Neyman-Pearson lemma provides us with a most powerful level

 test for simple vs. simple hypotheses.

Analogy: To fill up a bookshelf with books with the least cost, we should start by picking the one with the largest width/$ and continue. Similarly, to find a most powerful level

 test, we should start by including in the critical region those sample points that are most likely under the alternative relative to the null hypothesis and continue.

Define the likelihood ratio statistic by

L ( X ,

 

0 1

 p p (

( X

X |

|

1

0

)

)

, where p x

 is the probability mass function or probability density function of the data X .

The statistic L takes on the value

 when p ( X |

1

)

 p X |

1

)

0 and by convention equals 0 when both p ( X |

1

)

 p X |

2

)

0

.

We can describe a test by a test function

( )

( )

1 , we always reject H

0

. When

( )

. When c , 0 c 1 , we conduct a Bernoulli trial and reject with probability c

(thus we allow for randomized tests) When

( )

0 , we always accept H

0

.

We call

 k a likelihood ratio test if

10

 k x

 c

L

L

L x x x

 

0 1

 

0 1

 

0 1

 k k k

Theorem 4.2.1 (Neyman-Pearson Lemma): Consider testing

H

0

:

 

0

P X |

0

)

P

0

) vs.

H

1

:

 

1

P X |

1

)

P

1

)

(a) If

 

0 and

 k

(b) For each 0

 is a size a most powerful level

1

 likelihood ratio test, then test

, there exists a most powerful size

 k

 likelihood ratio test.

(c) If

 is a most powerful level level

 test, then it must be a likelihood ratio test except perhaps on a set A satisfying

P

0

( X

A )

P

1

( X

A )

0 is

Example 1:

X

1

, , X n iid N

.

H

0

:

 

0

,

H

1

:

 

1

.

The likelihood ratio statistic is

L X

 

0

, )

 i n 

1 i n 

1

1

2

1

2

 exp exp

1

2

( x i

1)

2

1

2 x i

2 n

 exp i

1 x i

 exp

 n / 2

1

2 i n

1 x i

11

Rejecting the null hypothesis for large values of

L X

 

0 1 is equivalent to rejecting the null hypothesis for large values of

 i n

1

X i

.

What should the cutoff be? The distribution of

 i n

1

X i under the null hypothesis is powerful level

 tests rejects for

N (0, ) so the most

 n i

1 n

X i

(1

)

where

 is the CDF of a standard normal.

Example 2:

P

0

0

0.1

1

0.1

P( X = x )

2

0.1

3

0.2

4

0.5

P

1

L x

 

0 1

0.3

3

0.3

3

0.2

2

0.1

0.5

0.1

0.2

The most powerful level 0.2 test rejects if and only if X =0 or 1.

There are multiple most powerful level 0.1 tests, e.g., 1) reject the null hypothesis if and only if X =0; 2) reject the null hypothesis if and only if X =0; 3) flip a coin to decide whether to reject the null hypothesis when X =0 or X =1.

Proof of Neyman-Pearson Lemma:

12

(a) We prove the result here for continuous random variables X . The proof for discrete random variables follows by replacing integrals with sums.

Let

* be the test function of any other level

 test besides

 k

. Because

* is level

,

E

P

0

. We want to show that

  k x dP

1 x

  

* x dP

1 x

.

We examine that

(

 k x

 

* x p

1 x

 kp

0 x d x

and show

(

 k x

 

* x p

1 x

 kp

0 x d x

0

. From this, we conclude that

(

 k x

 

* x p

1 x

 

(

 k x

 

* x kp

0 x d x because

  k

( ) p

0

( ) d x

 

,

 

* x p

0 x d x

 

Hence, we conclude that

(

 k x

 

* x p

1 x

0 as desired.

To show that

S

 

(

 k x

 

* x p

1 x

 kp

0 x d x

0

, let x

 k x

 

* x

S

S

0

 x

 k x

Suppose x

S

 

* x x

 k x

 

* x

. This implies

 k x

0

which implies that p

1

 kp

0

( )

. Thus,

S

 k x

 

* x p

1 x

 kp

0 x d x

0

.

Also, similarly,

13

S

 k x

 

* x p

1 x

 kp

0 x d x

0

and

S

0

 k x

 

* x p

1 x

 kp

0 x d x = 0

(since p

1

 kp

0

( ) for x

S

0

).

Thus,

(

 k x

 

* x p

1 x

 kp

0 x d x

0

and this shows that

(

 k x

 

* x p

1 x

0 as argued above.

(b) Let

 c

P p

0 1 x

 cp

0 x

 

F c

0

( ) where

F

0 is the cdf of p

1 p

0

under

P

0

. By the properties of CDFs,

( ) is nonincreasing in c and right continuous.

By the right continuity of

( )

 c

( c

0

)

. So define

, there exists c

0 such that

 

1 if

(

  c

0

)

  c

( )

0

if

0 if p

1 p

0 p

1 p

0 p

1 p

0

 c

0

 c

0

 c

0

Then,

14

E

P

0

 x

P

0

( p p

0

1

  c

0

 c

0

)

 

 c

0

( c

0

)

 c

0

 

 c

0

( c

0

)

 c

0

 c

0

)

P

0

(

 c

0 p

1 p

0

 

So we can take k to be c

0

.

 c

0

)

(c) Let

*

be the test function for any most powerful level test. By parts (a) and (b), a likelihood ratio test

 k with size

 k

 can be found that is most powerful. Since are both most powerful, it follows that

* and

(

 k x

 

* x p

1 x

0

(1.1)

Following the proof in part (a), (1.1) implies that

(

 k x

 

* x p

1 x

 kp

0 x d x which can be the case if and only if

*

0 x

  x

0 p

1

 when p

1 kp

0 when

( )

(i.e.,

 kp

0

L X

 

0 1

( )

(i.e., k

) and

L X

 

0 1

 k

* x

  x

) except on a

 set A satisfying

P

0

( X

A )

P

1

( X

A )

0

.

Connection of likelihood ratio tests to Bayes tests: Consider the 0-1 loss for hypothesis testing. The Bayes test chooses

H

1

:

 

1

over H

0

:

 

0

if p p (

(

1

0

|

| X

X )

)

1

, which is

15

equivalent to

L x

 

0 1

 

0

1

  

0 test

 k x

 c

L

L

L x x x

 

0 1

 

0 1

 

0 1

 k k k

1

. The likelihood ratio is a Bayes test for the prior

 

0

)

1

1

 k prior if p (

1

| X ) p (

0

| X )

L x

 

0 1

 k

.

L x

 

0 1

1 k

, because for this

is greater than 1 if and only

The difference between the Bayes approach and the

Neyman-Pearson approach is that the Bayes approach starts with a prior and this determines the cutoff k for the likelihood ratio test, while the Neyman-Pearson approach starts with a significance level (a maximum acceptable

Type I error rate) and this determines the cutoff k for the likelihood ratio test.

16

Download