HypTest6Step

advertisement
Hypothesis Testing Overview
Model and
assumptions,
e.g.,
Y~N(?, )
H0:  = 0
HA:  < 0,
 > 0,
  0
Design
 = 0.01, 0.05, 0.10
n  , 1- ≡ Power
Perform Survey, Experiment, or Observational Study.
Estimator, Standard error, C.I., Test Criterion
P-value
P<
P>
Decision
Reject H0
NOT Reject H0
Characterize
Statistically Sig, or
NOT Stat Sig
Effect   =   0.
The hypotheses, in
terms of the effect, are
H0 :  = 0
HA:  < 0,
 > 0, or
0
Estimated Effect
 ˆ  Y  0
P-Value = the
probability that the
estimated effect would
be as great as or
greater than that
observed, in the
direction specified by
HA, if H0 were true.
Conclusion
Method. Assuming [Y~N(?, ), verbally], we tested [H0 vs. HA, verbally], using the z-test ([cite
reference]) with significance level [] and sample size [n].
Results. There is significant statistical evidence that [H A, verbally] ([P-value]).
There is NOT significant statistical evidence that [HA, verbally] ([P-value]).
Truth Table
Decision
Reject H0
NOT Reject H0
True State of Nature
H0
HA
Type 1 error,
Correct Decision,

(1− )  Power
Type 2 error,
Correct
Decision
 O.C.
P P-value  The probability that the distance between
the estimator and the hypothetical value of the
parameter, in the direction specified by H A, would be
as great or greater as that observed, if H0 were true.

  significance level  Type 1 error rate, is set by investigator in Design step.
  operating characteristic  Type 2 error rate = f (; , n, ),-0, i.e., is a function of (i) the
effect, i.e., the difference between the true and the hypothetical (null) value of the parameter of
interest, (ii) the significance level, (iii) the sample size, and (iv) the underlying variability.
(1− )  Power  P{Reject H0 |; , n,  } = 1 − f (; , n, ).
Golde Holtzman
687291555
2/5/2016
Hypothesis Testing Overview
Guiding objectives of
Hypothesis Testing
Controlled by investigator
Significance level, 
Type 1 error rate, 
+
−
Sample size, n
Type 2 error rate, 
−
+
Underlying variability, 
−
−
Effect,  = (0)
Power, (1)

 = significance level = Type 1 error rate
= probability of rejecting a true null hypothesis.

 = operating characteristic = Type 2 error rate
= probability of not rejecting a false null hypothesis.
  = power = sensitivity = probability of rejecting a false null hypothesis.

 = (0) = effect = unknown true value of the parameter, , minus null hypothetical
value, 0.
Power = P{Reject Null}

Power Curve

1
0.8

n
smaller
SE 
0.6
SE 
0.4

n
larger
0.2
0
0
2
4
6
8
10
effect 
Golde Holtzman
687291555
2/5/2016
Hypothesis Testing Overview
Statistical Significance versus Practical (clinical, biological, economic, etc.)
Importance
Statistical significance. A test of significance/hypothesis is a test of the null
H0:  = 0, vs. HA:   0
hypothesis that the effect  = 0 versus an alternative such as  < 0,  > 0, or 
 0, where the effect  is the actual difference between the true value of the population parameter and
the hypothetical value from the null hypothesis. The result of the test, i.e., the decision and the
conclusion, depend on the observed effect
the parameter and the hypothetical value.
δ̂ , which is the difference between the sample estimator of
A non zero observed effect does not necessarily indicate that the actual effect is non-zero and null
hypothesis is false. Even if the null hypothesis is true and the actual effect is zero, the observed effect will
most likely be non-zero due to natural underlying variation, i.e., due to chance. In these terms, the
question we seek to answer by performing the test if significance is the following:
Is the observed effect the result of an actually non-zero effect, or is the observed
effect due to chance?
Rejecting the null hypothesis is saying that the observed effect (in the sample) is due to an actual effect
(in the population). Not rejecting is saying that the observed effect may simply be due to chance, in which
case there may be no actual effect in the population.
We therefore reject the null hypothesis only if the P-value is small. The P-value is the probability of so
large an observed effect due to natural underlying variation when the null hypothesis is true and the
actual effect is zero. I.e., the P-value is the probability of so large an effect due to chance. If the P-value is
small, we reject the null hypothesis and declare that the results are statistically significant, and there is
statistically significant evidence in favor of the alternative. Thus,
Statistical significance means that the observed effect is the result of an actual
non-zero effect in the population, and not due merely to the chance variation of a
random sample.
On the other hand, If the sample size is extremely large, such that the power of the test is more than
enough to detect an important difference (effect, relationship, etc.--whatever is specified by the alternative
hypothesis), then the results of the study can be statistically significant but not biologically (clinically,
economically, etc,) important. I.e.,
If the sample size is large, then statistical significance does not necessarily imply
practical importance.
If the sample size is so small that the power of the test is insufficient to detect an important difference,
then the results of the study can be of practical importance, yet not achieve statistical significance. I.e.,
If the sample size is small, then a failure to achieve statistical significance does not
necessarily imply a lack of practical importance.
Regardless of the result of a test of significance, it is always a good idea to estimate the parameter(s)
of interest with a confidence interval. Doing so refocuses attention to the magnitude of the parameter
of interest, which is, of course, of the subject of the study. [Note: In many studies, rather than reporting
confidence intervals, point estimates and standard errors are reported. Point estimates and standard
errors are sufficient, as everyone knows that approximate 95% confidence limits = (point estimate) ±
(2)(standard error).]
Example 1-A. Mendel’s theory of genetics implies that the proportion of females among humans ought to
be 50%. To test this research hypothesis, we will observe the sex (female, male) of new born human, i.e.,
the random variable of interest is whether or not sex-female for the i-th randomly sampled new-born. The
parameter of interest is the proportion, say , of human births that are female. The null and alternative
hypotheses are, respectively,
Golde Holtzman
687291555
2/5/2016
Hypothesis Testing Overview
H0:  = 0.50, vs. HA:   0.50.
We will observe the sample proportion p , and use the test criterion
Z
p  0.50
 0.50  0.50 
.
n
We plan to use a small significance level of  = 0.01 because only strong evidence would be convincing
for a theory that has stood for over a century, and a sample of size n = 10,000, because birth records are
easy and inexpensive to obtain.
Performing the study, we find that out of 10,172 births (why throw away 172 extra records we found),
5,202 are female. Thus, we have observed a sample proportion of
p  5, 202 10,172  0.5114  51.14% ,
and a test criterion of
Z
0.5114  0.5000
0.01140

 2.300 .
0.500.50 10,172 0.004958
Notice that the observed effect = 0.01140, which is hardly more than 1%, and rather a small departure
from what is predicted by the null hypothesis. On the other hand, this small observed effect amounts to
2.3 standard errors because of the large sample size, and a P-value for the two-tailed test is
P  2PZ  2.300  2  0.01072  0.02145  0.02
Now, because P = 0.02 is not as small as the significance level of  = 0.01, the decision is to not reject
the null hypothesis, and we conclude that
There is no significant statistical evidence that the proportion of female births is different from the
50% predicted by Mendel’s theory (P = 0.02).
We might also report an estimate:
We are 99% confident that the proportion of females among human births is between 49.9% and
52.4%.
or simply that
The observed proportion of females among human births was 51.1% (n = 10,172, SE = 0.496%).
and simply let readers calculate confidence limits themselves, taking for granted that it is universally
known that n represents the sample size and SE represents the standard error.
Concluding that the results are not statistically significant means that the observed effect might be due
merely to chance. Revealing that the estimated effect is only ˆ   p  0   0.011 gives the further
information that the result observed effect is not biologically important.
Example 1-B. Now suppose that we repeat the study but use a sample size of n = 100,000 birth records,
and observe 51,142 females. We then have observed a sample proportion of
p  51,142 100,000  0.5114  51.14% ,
and a test criterion of
Z
0.5114  0.5000
 0.50  0.50  100,000

0.01140
 7.210 .
0.001581
and P < 0.0001 (In fact, P < 0.000000000001).
Golde Holtzman
687291555
2/5/2016
Hypothesis Testing Overview
We report
There is highly significant statistical evidence that the proportion of female births is different from
the 50% predicted by Mendel’s theory (P < 0.0001).The observed proportion of females among
human births was 51.1% (n = 100,000, SE = 0.158%), i.e., we are 99% confident that the
proportion of human births that are female is between 50.7% and 51.5%.
Concluding that the results are highly statistically significant means that the observed effect can not be
explained by chance, and must reflect a real an actual departure from the 0.50 of the null hypothesis.
Revealing that the estimated effect is only 0.011 gives the further information that the result observed
effect is not biologically important, albeit statistically significant, because there are many reasonable
explanations for the small discrepancy (differential survival of the zygotes, etc.).
Example 1-A
Example 1-B
10,172
100,000
0.011
0.011
standard error, SE
0.00496
0.00158
test criterion, Z
2.3
7.2
P-value
0.02
< 0.0001
Statistical significance
slightly significant, statistically
very highly significant, statistically
Practical importance
little or no biological importance
little or no biological importance
observed effect,
ˆ
The significance test of Example 1-B with a huge
sample size and a small standard error has a
very small Type 2 error rate, very high power,
and is very sensitive to departures from the null
hypothesis. A very small observed effect—so
small that it is of no practical importance—is very
highly statistically significant.
The figure to the right is a graph of power curves
for sample sizes of n = 10,000 and n = 100,000.
The effect = (   0) where  is the true
probability of a female and  0 = 0.50 is the
hypothetical null value. An effect of 0.01 therefore
corresponds to  = 0.51. You can see that both
of these tests are quite sensitive.
1
Power = P{Reject H0}
sample size, n
0.8
0.6
0.4
0.2
0
0
.01
.02
.03
effect
Golde Holtzman
687291555
2/5/2016
Download