IEOR 165 – Lecture 25 Review I 1 Hypothesis Testing Frameworks

advertisement

IEOR 165 – Lecture 25

Review I

1 Hypothesis Testing Frameworks

1.1

Null Hypothesis Testing

The idea of null hypothesis testing is:

1. We begin with a base assumption about our system.

2. Data is measured from the system.

3. Then using measured data, we compute the probability of making observations that are as (or more) extreme than was measured. This probability is computed under the base assumption about the system, and this probability is called the p -value.

4. We make a decision on the basis of this probability. The reasoning is that if the above probability is small, then it is unlikely that we would have observed the measured data if the base assumption were true. The terms accept and reject are used to denote deciding the null is not-invalidated by the data or that the null is invalidated by the data, respectively.

There is a tension in this framework because we must make a decision regarding truth, but the framework of null hypothesis testing does not relate to this concept; instead, we make decisions on the basis of trying to invalidate some base assumption about the system.

Moreover, the quantity we use to make this decision is not the probability that the null hypothesis is true. Rather, the quantity is the probability of observing extreme events under the assumption that the null is true.

1.2

Neyman-Pearson Testing

In this framework, we define two hypothesis that we would like to choose between. Because there are two possible choices, there are two possible errors. The main idea is that one error is more important, and this determines how we label the hypothesis. The hypothesis for which mistakenly not choosing that hypothesis is more important is called the null hypothesis , and the other hypothesis is called the alternative hypothesis . An example from scientific fields is where the null hypothesis might represent “no effect”. We still talk about accepting

(choosing) and rejecting (not choosing) the null hypothesis.

With respect to the two hypothesis, there is certain associated terminology.

1

A type I error occurs when choosing the null hypothesis under the assumption that the null is true, and it is also known as a false positive .

Similarly, a type II error occurs when choosing the null hypothesis under the assumption that the alternative is true, and it is also known as a false negative .

A parametric hypothesis class consists of a family of distributions

P

θ

: θ

Θ, with the null and alternative hypothesis consisting of a specific set of parameters H

0

Θ and H

1

: θ

Θ

1

Θ, where Θ

1

Θ

2

=

.

: θ

Θ

0

The idea is to begin by defining the level of significance α to be the probability of committing a type I error. Because a test with significance level α also has significance level ˜ , we also define size be the minimum probability of committing a type I error for a specific test.

For instance, if we have a parametric hypothesis test defined by comparing a test statistic

T ( X ), which depends on some data X , to a threshold, then the size is defined by

α ( c ) = sup

{ P

θ

( T ( X )

≥ c ) : θ

Θ

0

}

.

Once the size of the test is fixed, the probability of a type II error becomes fixed as well. It is conventional to refer to the power of the test, which gives the probability of choosing the alternative hypothesis when it is true, and the power of a test is given by 1

− P

(Type II Error).

Moreover, the power function for all θ

Θ is defined as

β ( θ ) =

P

θ

(Rejecting H

0

) =

P

θ

( T ( X )

≥ c ) .

Perhaps confusingly, Neyman-Pearson hypothesis testing still has a concept of a p -value. In fact, the p -value can be defined in the same way as in the case of null hypothesis testing.

There is an alternative definition of a p -value that is worth mentioning, and this alternative definition applies to both null hypothesis testing and Neyman-Pearson testing: The p -value for a certain set of data and a specific test statistic is an upper bound on the significance level for which the null hypothesis would be rejected.

2 Null Hypothesis Testing

2.1

One-Sample Location Tests

In many cases, we are interested in deciding if the mean (or median) of one group of random variables is equal to a particular constant. Even for this seemingly simple question, there are actually multiple different null hypothesis that are possible. And corresponding to each of these possible nulls, there is a separate hypothesis test . The term hypothesis test is used to refer to three objects:

• an abstract null hypothesis;

• a concept of extreme deviations;

2

• an equation to compute a p -value, using measured data, corresponding to the null.

The reason that there are multiple nulls is that we might have different assumptions about the distribution of the data, the variances, or the possible deviations. One of the most significant differences between various null hypothesis are the concepts of one-tailed and two-tailed tests. In a one-tailed test, we make an assumption that deviations in only onedirection are possible. In a two-tailed test, we make an assumption that deviations in both directions are possible.

2.2

Confidence Intervals

Suppose we have a distribution f ( u ; ν, ϕ ) that depends on some parameters ν

∈ R p , ϕ

∈ R q .

The distinction between ν and ϕ is that ν are parameters whose values we care about, while ϕ are parameters whose values we do not care about. The ϕ are sometimes called nuisance parameters . The distinction can be situational: An example is a normal distribution f ( u ; µ, σ 2 ) = √ 1

2 πσ

2 exp(

( x

µ ) / 2 σ 2 ); in some cases we might care about only the mean, only the variance, or both. We make the following definitions:

A statistic ν is called a level (1

α ) lower confidence bound for ν if for all ν, ϕ ,

P

( ν

ν )

1

α.

A statistic ν is called a level (1

α ) upper confidence bound for ν if for all ν, ϕ ,

P

( ν

ν )

1

α.

The random interval [ ν, ν ] formed by a pair of statistics ν, ν is a level (1

α ) confidence interval for ν if for all ν, ϕ ,

P

( ν

ν

ν )

1

α.

These definitions should be interpreted carefully. A confidence interval does not mean that there is a (1

α ) chance that the true parameter value lies in this range. Instead, a confidence interval at level (1

α ) means that 100

·

(1

α )% of the time the computed interval contains the true parameter. The probabilistic statement should be interpreted within the frequentist framework of the frequency of correct answers. Another point to note about confidence intervals is that they are in some sense dual to null hypothesis testing.

2.3

Two-Sample Location Tests

In many cases, we are interested in deciding if the mean (or median) of one group of random variables is equal to the mean (or median) of another group of random variables. Even for this seemingly simple question, there are actually multiple different null hypothesis that are possible. And corresponding to each of these possible nulls, there is a separate hypothesis test. The reason that there are multiple nulls is that we might have different assumptions

3

about the distribution of the data, the variances, or the possible deviations.

One of the most significant differences between various null hypothesis for two-sample tests are the concepts of paired and unpaired tests.

The idea of a paired test is that each point in group 1 is related to each point in group

2. For instance, group 1 might be the blood pressure of patients before taking a drug, and group 2 might be the blood pressure of the same patients after taking a drug.

Mathematically, an example of a null hypothesis for a paired two-sample test is

H

0

: X i

, Y i

∼ N

( µ i

, σ

2

) , where X i

, Y i are independent, and µ i

, σ

2 is unknown .

The idea of this null hypothesis is that the i -th measurement from each group (i.e., X i and Y i

) have the same mean µ i

, and this mean can be different for each value of i .

The idea of an unpaired test is that each point in group 1 is unrelated to each point in group 2. For instance, we might want to compare the satisfaction rate of two groups of cellphone customers, where group 1 consists of customers with Windows Phones, and group 2 consists of customers with Android Phones. Mathematically, an example of a null hypothesis for an unpaired two-sample test is

H

0

: X i

, Y i

∼ N

( µ, σ

2

) , where X i

, Y i are independent, and µ, σ

2 is unknown .

The idea of this null hypothesis is that the i -th measurement from each group (i.e., X i and Y i

) have the same mean µ i

, and this mean can be different for each value of i .

2.4

Multiple Testing

Suppose there are k null hypotheses ( H 1

0

, H 2

0

, . . . , H k

0

) with k corresponding p -values

( p

1

, p

2

, . . . , p k

) , and assume that the desired maximum probability of having one or more false positives is given by the significance level α . The probability of having one or more false positives is also known as the familywise error rate.

The idea of the Bonferroni correction is to modify the testing procedure to reject hypothesis only when p i

< α/k . However, it turns out that the Bonferroni correction can be overly conservative, and more careful testing procedures can have greater power than the Bonferroni correction. An example is the Holm-Bonferroni method, which provides improvement over the Bonferroni correction. This method can be summarized as an algorithm:

1. Sort the p -values into increasing order, label these p

(1)

, p

(2)

, . . . , p

( k )

, and denote the corresponding null hypotheses as H

0

(1)

, H

0

(2)

, . . . , H

0

( k )

;

4

2. Let m be the smallest index such that p

( m )

> α/ ( k

− m + 1); if no such m exists, then reject all hypotheses; if m = 1, then do not reject any hypothesis;

3. Otherwise, reject the null hypotheses H

0

( m ) ( m +1)

(1) potheses H

0

, H

0

, . . . , H

( k )

0

;

, H

0

(2)

, . . . , H

0

( m

1) and accept the null hy-

2.5

Quality Control

Quality control is essentially a restatement of confidence intervals, but adapted towards a specific engineering application. There are X control charts, S -control charts, and nonparametric control charts.

However, picking an appropriate value of α can be difficult. One convention is to pick z (1

α/ 2) = 3, since this corresponds to a level 0 .

0027 confidence interval. Another option is to pick the value of α such that we control the expected number of testing periods before concluding the process is out of control, under the assumption that the process never goes out of control. The reason that we will eventually conclude the process it out of control, even if the process were to never go out of control, is due to multiple testing. The choice of α only controls the probability of erroneously concluding the process is out of control for only one subgroup. However, we are repeatedly conducting tests and so we will suffer a familywise error rate. It turns out that

E

( T ests ) =

1

α

. This equation gives us a way to choose the value of α given a desired stopping time.

For example, if we would like to have an expected number of analyzed subgroups before stopping to be 100, then we should choose α = 0 .

01. The LCL and U CL can then be computed using the appropriate formulas.

2.6

Hierarchical Log-Linear Models

We can test for general interactions between categorical variables. Suppose we have p categories, and let [ p ] =

{

1 , . . . , p

}

. Define a simplicial complex Γ to be a set such that:

Γ

2

[ p ]

, where 2

[ p ] denotes the power set of [ set of all subsets of [ p ].

p ]. Recall that the power set 2

[ p ] is the

If F

Γ and S

F , then S

Γ.

The elements F

Γ are called the faces of Γ, and the inclusion-maximal faces are the facets of Γ. Then we can define a null hypothesis

H

0

:

P

( x

1

= i

1

, . . . , x p

= i p

) =

1

Z ( θ )

F

∈ facets(Γ)

θ i

( F )

,

F where Z ( θ ) is a normalizing constant so that this distribution sums to one, and sets of parameters indexed by the values of i

F

=

{ i f

1

, i f

2

, . . .

}

, where F =

{ f

1

, f

2

θ ( F ) are

, . . .

}

. This

5

probability distribution model is known as a hierarchical log-linear model.

This null hypothesis may be difficult to interpret, and so it is useful to consider some special cases. If Γ = [1][2] (meaning there are two facets which are singleton), then this corresponds to a null hypothesis in which the first and second categories are independent. If

Γ = [12][13][23], then this corresponds to a null hypothesis in which there are no three-way interactions between the categories. Even more general null hypothesis are possible. For instance, we may have Γ = [12][23][345].

3 Neyman-Pearson Testing

4 Neyman-Pearson Lemma

One of the benefits of Neyman-Pearson hypothesis testing is that there is powerful theory that can help guide us in designing parametric hypothesis tests. In particular, suppose we have simple hypothesis that consist of a single point

H

0

: θ = θ

0

H

1

: θ = θ

1

.

For a given significance level α , we can ask what is the most powerful test? By most powerful, we mean the test with highest power over the set of all tests with level α .

We define the simple likelihood ratio to be

L ( X, θ

0

, θ

1

) =

∏ n

∏ i =1 n i =1 f

θ

1

( X i

)

, f

θ

0

( X i

) where f

θ

0

(

·

) and f

θ

1

(

·

) denote the density function under the null and alternative. The likelihood ratio test (or Neyman-Pearson test ) is defined by

Accept the null if L ( X, θ

0

, θ

1

) < k .

Reject the null if L ( X, θ

0

, θ

1

) > k .

When the likelihood ratio is equal to k , theoretically we need to introduce randomization into the test; however, this is typically not an issue in practice.

The Neyman-Pearson lemma has several important consequences regarding the likelihood ratio test:

1. A likelihood ratio test with size α is most powerful.

2. A most powerful size α likelihood ratio test exists (provided randomization is allowed).

3. If a test is most powerful with level α , then it must be a likelihood ratio test with level

α .

6

Download