Lecture 7 - The Department of Statistics and Applied Probability, NUS

advertisement
Hypothesis Testing
Testing your beliefs
Ultimately in most research, the aim is to investigate whether the data
supports a particular hypothesis, or whether there is evidence to reject this
hypothesis.
For example:
How do you know that male students are more likely to weigh more than
female students?
How do we know that smoking is associated with increased risk of lung
cancer?
How do we know that possessing a particular genetic profile means
someone is more likely to be obese than others with a different genetic
profile?
Data exploration and Statistical analysis
1. Data checking, identifying problems and characteristics
2. Understanding chance and uncertainty
3. How will the data for one attribute behave, in a
theoretical framework?
4. Theoretical framework assumes complete information,
need to address uncertainties in real data
5. Testing your beliefs, do the data support what you think
is true?
Data
Data exploration,
categorical / numerical
outcomes
Model each outcome with
a theoretical distribution
Estimation of parameters,
quantifying uncertainty
Hypothesis testing
Parametric tests
(t-tests, ANOVA,
test of proportions)
Hypothesis Testing
• Null hypothesis
A statement of status quo, or of no changes
• Alternative hypothesis
Hypothesis which the researcher wishes to investigate
• Commonly, the alternative hypothesis is first formulated,
and the null hypothesis is the negation of the alternative
hypothesis.
Pregnancy Test Kit
A woman buys a pregnancy test kit, and is interested to find
out whether she is pregnant.
The null hypothesis in this case (status quo), is that she is
not pregnant.
The alternative hypothesis (hypothesis of interest), is that
she is pregnant.
Test kit may show:
+ve: indicating there is evidence to suggest pregnancy
–ve: indicating lack of evidence to suggest pregnancy
Pregnancy Test Kit
The test kit may either be accurate, or inaccurate.
Actually pregnant
Actually not pregnant
Test kit shows +ve
Correct +ve diagnosis
Incorrect +ve
diagnosis
Test kit shows –ve
Incorrect –ve
diagnosis
Correct –ve diagnosis
Pregnancy Test Kit
The test kit may either be accurate, or inaccurate.
Types of Errors
Type 1 Error
(p-value)
False +ve conclusion
(+ve when woman is in fact not pregnant)
Type 2 Error
False –ve conclusion
(–ve when woman is in fact pregnant)
Power
(Sensitivity)
True +ve conclusion
(+ve when woman is in fact pregnant)
Specificity
True –ve conclusion
(–ve when woman is in fact not pregnant)
P-values
• Probability of observing a false positive result, also known
as the significance of the test.
• If the p-value is small, we are more confident that the null
hypothesis can be rejected.
• On average, expect 1 false positive result out of 20 results
obtained.
• So if we perform a large study with 1 million variables, on
average we expect about 50,000 variables to display pvalues of < 0.05!
Statistical tests for comparing means
One of the most common statistical tests in biomedical sciences is the
So here we have the mean weights
comparisons of averages.
for male and female students. Can
Let’s revisit Example 1 from thewe
previous
lecture.
compare
these values, after
accounting for the uncertainties in
Example 1:
thetoestimation,
and quantify
the
The Science Faculty is interested
compare between
the weights
of male
and female students in NUS. statistical evidence for observing a
difference?
Recall we:
- Randomly sample 200 male students and 200 female students and
measure their weight.
- Calculate the mean weight of these 200 male students, and use this
quantity to estimate the mean weight of all the male students in NUS.
- Similarly calculate the mean weight of these 200 female students and use
this to estimate the mean weight of all the female students in NUS.
Test statistic
Test statistic
A numerical quantification of the amount of evidence against the null
hypothesis.
Usually takes the form of:
(Observed summary value – Hypothesized summary value)
--- divided by --Standard error of observed summary value
Or
Degrees of freedom
Number of independent observations that are allowed to take any values.
Degrees of freedom =
(df)
Number of independent
–
observations
(usually the sample size)
Number of estimated
parameters
For example: In an assessment of whether the mean weight of 120 male
students in the Science faculty exceeds 70kg, we actually have 120
independent observations.
However, we need to estimate the mean weight of these 120 students, thus
the degrees of freedom remaining = 120 – 1 = 119.
Another way to think about it: Upon collecting the weight of 120 students, IF
I know the average weight, then only the weights of 119 students are
allowed to ‘vary’ as the weight of the last person must be the specific value
to yield the average weight that I know of.
t-tests
1-Sample t-test
Useful if we are keen to compare a collection of values against a
hypothesized mean value.
For example:
Previous surveys from the 1980s found that the mean weight of male
students in the Science Faculty was 63kg. The lecturer of ST1232 believes
that this figure is likely to be an under-estimate for male students in 2010.
Design an experiment to test this hypothesis.
- Randomly sample 200 male students from the Science Faculty, and
measure their weights.
- Calculate the mean weight of these 200 students and the corresponding
standard error of the mean, and see whether this is significantly higher than
63kg.
Identifying the hypotheses
In hypothesis testing, it is extremely important to identify the hypotheses that
are being tested, since this affects the calculation of the statistical evidence.
Null hypothesis,
Alternative hypothesis,
H0: Mean weight,  = 63
H1: Mean weight,  > 63
But there are two other alternative hypotheses, each giving different
outcome in the estimation of statistical evidence against the null hypothesis.
To see whether the weight of current students differ from 63kg:
H1: Mean weight,   63
To see whether the weight of current students is lighter than 63kg:
H1: Mean weight,   63
One-sided versus two-sided tests
Two-sided alternative hypothesis:
Unbiased test, without assuming prior knowledge or expectation of the
direction of the effect size.
So the difference could be greater or smaller than the test value.
Value estimated
from the data
measures how
Remember Numerator
the
test
statistic 
different is the data from the
null hypothesis
Value under the
null hypothesis
Denominator
measures the
uncertainty of the
estimation
One-sided versus two-sided tests
Two-sided alternative hypothesis:
Unbiased test, without assuming prior knowledge or expectation of the
direction of the effect size. So the difference could be greater or smaller than
the test value.
Example of an alternative hypothesis: there is a difference between the
height of men and women.
One-sided alternative hypothesis:
Biased test, where the direction of the effect, if genuine, is known.
Examples: (i) men are taller than women; (ii) the weight loss pill successfully
reduces weight.
One-sided versus two-sided tests
One-sided
tests
Two-sided
tests
Interpreting statistical evidence:
P-values are assessed by the probability found in the shaded areas: the
shaded areas represent the probability of obtaining a test statistic at least as
extreme as from the observed data, under the null hypothesis.
SPSS calculates the two-tailed p-value by default, need to work harder to
get the one-tail p-value.
One-sided versus two-sided tests
One-sided
tests
Two-sided
tests
Converting two-tailed p-value to one-tail p-value:
- If observed effect (or test statistic) is in the same direction as the
alternative hypothesis, half the p-value.
- If observed effect (or test statistic) is in the opposite direction as the
alternative hypothesis, one-tail p-value = 1 – 0.5  (two-tailed p-value)
Example 1:
A pharmaceutical company is interested in testing a new weight-loss
treatment, and recruited 120 volunteers to take part in a research trial. The
weight of the participants are measured before and after taking the weightloss treatment for the prescribed duration.
- Null hypothesis here is that the weight-loss treatment has no effect (or
average difference in weight = 0)
- Alternative hypothesis here is that the weight-loss treatment is effective (or
average difference in weight < 0)
Suppose the evidence obtained yields a two-sided p-value of 0.03, and the
average difference (after – before) is -3.5kg, what is the correct evidence for
the trial?
One-tail p-value = 0.03 / 2 = 0.015 (since direction of observed effect is in the
same direction as the alternative hypothesis)
If however, the average difference (after – before) is 3.5kg, the one-tail pvalue will instead be 1 – 0.03 / 2 = 0.985. This intuitively makes sense!
Back to t-tests
1-Sample t-test
Useful if we are keen to compare a collection of values against a
hypothesized mean value.
Null hypothesis: Mean = hypothesized value
ASSUMPTIONS
- Data is normally distributed (theoretical assumption)
- Data is symmetrically distributed (practical application)
- Observations made are all independent
Two independent samples t-test
For comparing the means between two groups.
Null hypothesis : Mean of group 1 = Mean of group 2
Or effective
: Difference in means = 0
Example: Comparing the weights between male and female students in
Science faculty.
ASSUMPTIONS
- Data is normally (symmetrically) distributed within each group
- Observations made within each group are all independent
- Observations are also independent across the groups
Paired-sample t-test
For comparing the difference within each pair of observations
Null hypothesis : No difference between the observations in each pairing
Or effective
: Difference within each pairing = 0
Example: Comparing the efficacy of a diet treatment, thus comparing the
weight of an individual before and after the treatment.
ASSUMPTIONS
- Difference within each pairing follows a Normal (symmetric) distribution
- Independence between pairs of observations
Two independent samples vs
paired-sample t-test
Is there any difference between these two:
Mathematically they seemed similar?
Main difference in the calculations of the denominators:
takes into account of the uncertainty of two sets of
outcomes.
first calculates the difference within each pairing, then calculate
the standard error of the string of differences.
 2 groups
Suppose we are interested in comparing the means between three groups,
what can we do?
(A) Perform 3 sets of two-independent samples t-tests (between groups 1
and 2; groups 2 and 3; groups 1 and 3)
(B) Perform a ‘global’ test, checking whether the means are all the same.
Analysis of Variance
(ANOVA)
- A statistical test for comparing the means of multiple independent groups
- Test the null hypothesis that the means of ALL the groups are identical
Assumptions
- Data within each group is normally (symmetrically) distributed
- Independent observations within each group
- Independent observations between the groups
ANOVA
Analysis of variance (ANOVA) – interesting name that we are effectively
analysing the variance to decide whether there is any differences in the
means!
Null hypothesis
Alternative hypothesis
: Means of all the groups are identical
: At least one of the groups has a different mean
So, observing a significant p-value in this instance means that at least one
group is different, but we don’t actually know which group that is!
Practical Example
• Previous studies suggest restriction caloric intake can
increase life expectancy.
• Perform an experiment with mice, each randomly
assigned to one of six diet treatment.
• Measure the time of death for each mouse (in months).
Experimental Design
Visualising the Data
Practical Example
Research Questions
• Is there any difference in life expectancy across the
different diet treatments?
• If there is, which diet treatment contribute to this
difference?
• Which diet treatment significantly increases life
expectancy?
Practical Example
ANOVA
Q: Is there any difference in life expectancy across the
different diet treatments?
Consider the following null hypothesis:
All the mean life expectancy of the six groups are identical
Alternative hypothesis:
At least one group has a different mean life expectancy
Significant
differences!
But which
treatment?
Multiple Comparisons
• Can compare every possible pair of treatments.
DANGER!
• More number of tests  more chances of making a false
judgement.
• Remember p-value threshold of 0.05  1 out of 20
judgement may be false.
• There are 15 possible pairings for the 6 treatment groups
 very likely to make a false judgement!
Bonferroni Correction
• Make it harder to define a result as significant.
• By lowering the p-value threshold. But to lower by how
much?
Solution
Divide the threshold by the total number of tests performed.
Thus instead of a critical threshold of 0.05, we now use a
critical threshold of
0.05
 0.0033
15
Bonferroni Correction
• A much preferred approach is to calculate the “Bonferroni
corrected p-value” instead
- Different p-value thresholds may be used, and difficult
to decide what the Bonferroni-corrected thresholds are.
- By calculating the Bonferroni-corrected p-value, it can
be up to the researcher / author / editor / reviewer to
decide what the global p-value threshold should be.
Bonferroni-corrected p-value
Multiple the obtained p-values by the number of tests
performed.
Post-Hoc
Analysis
Multiple Comparisons
Dependent Variable: LIFETIME
Bonferroni
(I) GROUP
lopro
NN85
NP
NR40
NR50
RR50
(J) GROUP
NN85
NP
NR40
NR50
RR50
lopro
NP
NR40
NR50
RR50
lopro
NN85
NR40
NR50
RR50
lopro
NN85
NP
NR50
RR50
lopro
NN85
NP
NR40
RR50
lopro
NN85
NP
NR40
NR50
Mean
Difference
(I-J)
6.9945*
12.2837*
-5.4310*
-2.6115
-3.2000
-6.9945*
5.2892*
-12.4254*
-9.6060*
-10.1945*
-12.2837*
-5.2892*
-17.7146*
-14.8951*
-15.4837*
5.4310*
12.4254*
17.7146*
2.8195
2.2310
2.6115
9.6060*
14.8951*
-2.8195
-.5885
3.2000
10.1945*
15.4837*
-2.2310
.5885
Std. Error
1.25652
1.30637
1.24086
1.19355
1.26207
1.25652
1.30101
1.23521
1.18768
1.25652
1.30637
1.30101
1.28588
1.24030
1.30637
1.24086
1.23521
1.28588
1.17110
1.24086
1.19355
1.18768
1.24030
1.17110
1.19355
1.26207
1.25652
1.30637
1.24086
1.19355
*. The mean difference is significant at the .05 level.
Sig.
.000
.000
.000
.440
.175
.000
.001
.000
.000
.000
.000
.001
.000
.000
.000
.000
.000
.000
.249
1.000
.440
.000
.000
.249
1.000
.175
.000
.000
1.000
1.000
95% Confidence Interval
Lower Bound Upper Bound
3.2803
10.7086
8.4222
16.1452
-9.0988
-1.7631
-6.1395
.9166
-6.9306
.5306
-10.7086
-3.2803
1.4435
9.1348
-16.0766
-8.7743
-13.1166
-6.0953
-13.9086
-6.4803
-16.1452
-8.4222
-9.1348
-1.4435
-21.5156
-13.9137
-18.5613
-11.2289
-19.3452
-11.6222
1.7631
9.0988
8.7743
16.0766
13.9137
21.5156
-.6422
6.2811
-1.4369
5.8988
-.9166
6.1395
6.0953
13.1166
11.2289
18.5613
-6.2811
.6422
-4.1166
2.9395
-.5306
6.9306
6.4803
13.9086
11.6222
19.3452
-5.8988
1.4369
-2.9395
4.1166
Questions
• So why don’t we perform the post-hoc analyses all the
time then?
• p-values, is there a difference between 0.049 and 0.051?
• p-values and effect sizes, which is better? P-values or
confidence intervals?
• Power (sensitivity) and specificity, can we attempt to
maximise both?
t-tests in SPSS
Consider the mathematics.xls dataset again.
1. The average marks of the mathematics exam before starting the omega 3
trial is 70. Is there any evidence that the marks after the omega 3 trial is
higher than 70?
2. It is traditionally believed that male students tend to outperform female
students in mathematics. Based on the marks before the start of the trial, is
there any evidence in support of this hypothesis.
3. Is there any evidence that consuming omega 3 improves the performance
in the mathematics exam?
4. Is there any difference in the marks before the trial between the three
schools? If there is, which school exhibited the best performance?
5. Is there any difference in the omega 3 consumption between male and
female students?
Average marks after = 70?
This should be relatively straightforward that we need to perform a onesample test of the mean value.
However, before we can decide on the use of a one-sample t-test, we need
to check whether the data is indeed symmetrically distributed (and to go
through the usual data exploratory procedures).
Recall there are at least two ways of doing this in SPSS:
1. Qualitative assessment using a histogram.
2. Quantitative assessment with a Shapiro-Wilk’s Test.
Shapiro-Wilk’s test whether the data is
normally distributed.
H0: Data is normally distributed
H1: Data is not normally distributed
H0: Mean marks = 70
H1: Mean marks > 70 (1-tailed test)
Actual 2-tailed p-value = 3.28  10-5
1-tailed p-value = 1.64  10-5
Conclusion: There exists significant evidence that the marks after
consuming omega 3 is greater than 70 (p-value = 1.64  10-5).
Males better than females?
It should be immediately clear that there are 2 groups (or “populations”)
here: one for male students, and one for female students.
Thus we can use the 2-independent samples t-test to compare the mean
marks for the two groups. However, as before, we need to assess whether
there is any violation of the normality or symmetrical distribution assumption.
Important: As there are two groups, we need to assess whether both groups
satisfy this assumption!
Assessing assumptions for 2-independent
samples t-test
H0: Mean marks for males = Mean marks for females
H1: Mean marks for males > Mean marks for females (1-tailed test)
Conclusion: There is no evidence (pvalue = 0.77) that males perform better
than females in the mathematics exam
before the start of the omega 3 trial.
Which row to interpret?
Levene’s test for equality of variances:
H0: Variance for group 1 = Variance for group 2
H1: Variance for group 1 Variance for group 2
2-tailed p-value, need to convert to 1-tailed p-value. As
mean difference = -0.935, which is in opposite direction
to H1, therefore:
1-tailed p-value = 1 – 0.5  0.460 = 0.77
However, regardless of
what you see here,
always go ahead and
interpret the second row
not assuming equal
variances..
Evidence that omega3 improves
performance?
Most appropriate analysis is to compare the marks after against the marks
before for the same individual.
Here the appropriate test is the paired-sample t-test.
H0: There is no difference between the marks before and after within each
individual (or Difference = 0)
H1: The mark after taking omega3 is higher than the mark before (or
Difference > 0)
Conclusion: There is overwhelming
evidence (p-value = 2.03  10-14) that the
exam marks after the omega 3 trial are
higher than the marks before the trial.
Actual 2-tailed p-value = 4.06  10-14
1-tailed p-value = 2.03  10-14
Since mean difference is in the same
direction as the alternative hypothesis.
Is there any difference in the performance
across the 3 schools?
There are three groups that we want to compare the average marks. The
use of an ANOVA should immediately come to mind for this purpose.
H0: There is no difference in the mean marks between the three schools.
H1: At least one school has a different mean mark when compared to the
rest of the schools.
Conclusion: There is at no significant evidence (p-value =
0.063) of a difference in marks between the three schools.
(or at best, marginal evidence of a difference, between
schools 1 and 2).
Is there any difference in omega 3
consumption between males and females?
Feeling confident, we decided we can jump straight to performing the definitive
analysis without exploring the data.
So the appropriate test here is a 2-independent sample t-test.
Is this correct?
With histograms like these, there really isn’t a need to
perform the Shapiro-Wilk tests!
Parametric tests
It’s clear that any tests that assume the data to be symmetrically distributed
will not be applicable in comparing omega 3 consumption, since this
outcome is extremely right skewed.
There is thus a need for statistical tests that do not explicitly require such
assumptions on the distribution of the variable – also known as nonparametric tests.
So far, the tests we have seen are considered parametric tests – which
require assumptions on the distribution of the data to be satisfied before
they can be correctly used.
REMEMBER! The computer does not recognise when a test is
inappropriate, you have to know!
Relationship between p-values and
confidence intervals
Let’s have a quick recap of all the valid analyses:
Look at the p-values and
the confidence intervals.
Notice a trend?
Significant at 0.05
threshold and 0 not in
95% CI
Not Significant at 0.05
threshold and 0 is in 95% CI
Significant at 0.05 threshold
and 0 is not in 95% CI
Actually, even at the
numerical EDA stage,
the 95% confidence
intervals about the
means are already
very informative for
indicating whether
there will be any
differences between
the two means in a
formal comparison.
Relationship between p-values and
confidence intervals
Thus, if the test value under the null hypothesis falls within the 95% CI of the
estimated quantity  p-value from the hypothesis test will be > 0.05.
If the test value under the null hypothesis does not fall within the 95% CI 
p-value from the hypothesis test will be < 0.05.
Conversely,
If the p-value from the hypothesis test is < 0.05  the 95% CI will not
contain the test value under H0.
If the p-value from the hypothesis test is > 0.05  the 95% CI is certain to
contain the test value under H0.
Students should be able to
• understand the concept of the null and alternative hypotheses
• know what a test statistic is, and understand that it is always
calculated assuming the null hypothesis is true
• understand what is meant by power, sensitivity, specificity and
type I and type II errors
• understand and interpret a p-value from a hypothesis test
• know which statistical tests should be used
• know the assumptions for these statistical tests
• understand the relationship between a p-value and CI
• perform the appropriate analyses in SPSS and RExcel
Download