Text - Tufts University

advertisement
OCW Epidemiology and Biostatistics, 2010
Steven A. Cohen, DrPH, MPH
Tufts University School of Medicine
November 2, 2010
HYPOTHESIS TESTING, THE ROLE OF CHANCE, COMMON STATISTICAL TESTS
Learning objectives for this session:
1) Be able to perform the key steps to hypothesis testing
2) Understand the concept and use of the alpha value
3) Understand in general terms how the p value is derived
4) Correctly interpret the p value
5) Distinguish between type I and type II error
6) Understand the relationships among sample size, power, type I and type II error
7) Have a general notion of the concept of power as it relates to study conclusions
8) Know how a two-sample t-test is applied
9) Know how to interpret the results of a two-sample t-test
10) Know when it is appropriate to use a chi-square test
11) Know how to interpret the results of a chi-square test
12) Define the following terms: hypothesis testing, null hypothesis, alternate hypothesis, type I
error, type II error, p-value, power, chi square test, t-test
Outside preparation:
Pagano & Gauvreau
Chapter 10, pages 232-246, 249-258
Chapter 11, pages 265-284
Chapter 15, pages 342-349
Note
The following two lectures are designed to orient students toward interpretation of the medical
literature to have a more complete and structured understanding of how statistical analyses are used in
biomedical research. The way in which this material is presented is specifically designed for the
medical student who has minimal background in statistics, and as such, emphasizes interpretation,
rather than the theory behind the statistics. Those who have had formal coursework in statistics before
this course are encouraged to consult the instructors of the course if they wish to learn more of the
mathematical and conceptual details of the information presented in these lectures.
HYPOTHESIS TESTING
A researcher wishes to know if a new vaccine will reduce the likelihood of getting HIV in a sample of
adolescents. In statistics, we can frame this research question into two hypotheses and conduct what is
called hypothesis testing.
1
Null hypothesis, H0: The proportion of people who contract HIV will be the same for adolescents who
have received the new vaccine and those who have not received the vaccine.
Alternate hypothesis, HA: The proportion of people who contract HIV will be the different for
adolescents who have received the new vaccine and those who have not received the vaccine.
Note that the alternate hypothesis states that the proportions in each group who contract HIV will be
different. This allows for the possibility that the new vaccine could reduce or possibly increase the
likelihood of contracting HIV. Under this scenario, the alternate hypothesis is called two-tailed.
Rarely, one-tailed alternate hypotheses are used in the medical literature, and would suggest that, in this example, those
who were vaccinated would have a lower risk of contracting HIV than those who were not. It should be noted that the FDA
does not permit one-tailed hypothesis in most clinical trials.
In statistics, we assume the null hypothesis is true, unless there is substantial and convincing
evidence to reject the null hypothesis. Hypothesis testing basically seeks to determine whether there
is enough evidence to reject the null hypothesis in favor of the alternate, but the default is always the
null hypothesis- i.e. no effect or no difference.
Next, we move on to how we write and symbolize the null and alternate hypotheses in statistics. The
null hypothesis is typically denoted as H0, pronounced “H-not,” whereas the alternate hypothesis is
denoted HA or Ha. Let Pvaccine = the proportion of those who were vaccinated who contract HIV and
Pnovaccine = the proportion of those who were not vaccinated who contract HIV. If we want to test the
above hypothesis pertaining to the proportion of adolescents vaccinated who contract HIV versus the
proportion of adolescents who do not get vaccinated, we denote it as follows:
Scenario 1
H0: Pvaccine = Pnovaccine
HA: Pvaccine ≠ Pnovaccine
Scenario 2
Pvaccine - Pnovaccine = 0
Pvaccine - Pnovaccine ≠ 0
Scenario 3
Pvaccine /Pnovaccine = 1
Pvaccine / Pnovaccine ≠ 1
Each of these hypothesis formulation scenarios is equivalent to the others. If we wish to test a
difference between mean fasting blood glucose levels comparing those on the drug Reducagluce versus
those on placebo, we denote the null and alternate hypotheses as follows: Let μ –represent the
population mean blood glucose level; therefore μReducagluce = the mean blood glucose level who are
given Reducagluce and μplacebo = the population mean blood glucose level of those on placebo.
Scenario 1
H0: μReducagluce = μplacebo
HA: μReducagluce ≠ μplacebo
Scenario 2
μReducagluce - μ placebo = 0
μReducagluce - μ placebo ≠ 0
The hypothesis is framed in terms of the population parameter, not the sample statistic, given
that we wish to draw inferences about the larger population, not just the people in the study. For
instance, the researcher examining the role of a new HIV vaccine in adolescents finds of the
adolescents who get vaccinated, 6% contract HIV, compared to 9% of adolescents who contract HIV
who are not vaccinated. We know there is a difference between the sample groups (in practice, there
will almost always be some difference). The goal of research is not to make inferences about a sample,
but to make inferences about the population. In this way we use the sample statistics, with appropriate
uncertainty, to make inferences about population differences and/or associations.
2
Research studies do not always state the hypothesis being tested explicitly. It is good practice in
reading articles to first identify the central hypothesis or hypotheses being tested in order to understand
the article more easily.
THE ROLE OF CHANCE
A major role of statistics in biomedical studies is to quantify the role of chance in producing the
associations or differences between groups observed. In the example above, if we find that adolescents
who are administered vaccine have a 50% reduced risk (9% vs. 6%) of getting HIV compared to
adolescents who were not administered the vaccine, is that difference between groups due to a real
effect of the vaccine or is it due to chance? In other words, under what circumstances do we reject the
null hypothesis? The following sections discussing Type I and II error and the p-value will help clarify
how hypothesis testing is conducted in biomedical research.
Alpha Value and Type I Error
The alpha value (α) represents a threshold beyond which the null hypothesis would be rejected in
hypothesis testing. The alpha value must be established by the investigator before the study begins.
In theory, the alpha value could be anything, but in practice is often set to 0.05, or 0.10. If we subtract
alpha from 1, we get the confidence levels used in confidence intervals. For example, an alpha value
of 0.05 corresponds to a 1 – 0.05, or 0.95, or 95% confidence interval. Similarly, an alpha of 0.10
corresponds to a 90% confidence interval.
The alpha value is also called the type I error rate. Type I error occurs when the null hypothesis is
rejected, when in reality, the null hypothesis should not have been rejected. In other words, we
observed a difference in the sample when, in the population, there is no difference.
In the figure below, we see an example of determining the rejection region of a hypothesis test as it pertains to determining
differences between means under the null hypothesis μReducagluce - μ placebo = 0 for an alpha of 0.05. The total area under the
normal curve is equal to 1. Since we make the assumption that the data from which the means are obtained are normally
distributed, we can use certain facts the Gaussian (or normal) distribution. First, we know that 95% of the observations of a
Gaussian distribution fall within ±1.96 standard deviation or z-score units of the mean, or in this case, the difference in
means. Next, we distribute alpha of 0.05 between the two extremes of the distribution, under the notion that the mean
glucose level for those on the drug could be higher or lower than those on placebo. If, when we run the test (a t-test,
discussed below), we find that the test statistic falls between -1.96 and 1.96, we do not reject the null hypothesis. If,
however, the test statistic is above 1.96 or below -1.96, this falls in the rejection region and we reject the null hypothesis. A
similar procedure is used for population proportions, odds ratios, and other population parameters.
3
Rejection region
or
α/2
or
2.5 % of the observations
-1.96
Rejection region
or
α/2
or
2.5 % of the observations
Z-scores
0
+1.96
95% of total observations
Figure 1: Illustration of α using an example of the sampling distribution for a difference between means
We reject the null hypothesis when the p-value is smaller than the pre-established alpha value.
Therefore, if the alpha value is 0.05 and the p-value obtained is 0.31, we do not reject the null
hypothesis. If the alpha value is 0.05 and the p-value is 0.001, we reject the null hypothesis in favor of
the alternate hypothesis, and these results are indicative of not being by chance, but instead due to a
real difference.
The confidence interval and the p-value are related to each other. For the same alpha, when the
confidence interval excludes the null value, this means that the corresponding p-value for the
hypothesis test will be less than alpha, and vice versa.
If, for example, you wish to determine whether the difference in means between the drug and placebo groups is
significantly different than 0, there are two ways of going about doing this: construct a confidence interval around the mean
and determine whether or not it includes the null value of 0, or conduct a hypothesis test (for example, a t-test, below) and
determine if the test statistic has a corresponding p-value below the alpha value. Below is a schematic diagram of this
concept.
4
Figure 2: Relationship between confidence intervals and p-values schematic diagram.
When p-value is below the established alpha level, the corresponding confidence intervals do not contain
the null value of the parameter estimate
Beta, Statistical Power, and Type II Error
Another important concept in statistics related to alpha (type I error) is beta (β, the type II error
rate). Type II error occurs when the null hypothesis is not rejected, when in fact it should have
been. This type of error occurs when a study is underpowered, meaning that the sample size is not
sufficiently large enough to provide an adequate standard error to observe a significant difference or
association.
Power, defined as 1 – β, or 1 minus the type II error rate, is the probability that a statistical test will
result in the rejection of the null hypothesis, when in fact, the null hypothesis should be rejected.
A power analysis is often used in advance of a study to determine the minimum sample size required to
accept the outcome of a statistical test using an expected effect size with a particular level of
confidence. In a high powered study, it is easier to detect smaller differences between groups than in
lower powered studies. Power increases with increasing sample size and increasing expected effect
size for a fixed alpha level. As you learned in the previous lecture, studies with small sample sizes
may be underpowered, which may result in Type II error, as well as wide confidence intervals.
Alpha and beta are essentially independent of each other, and can take on any value that the researcher wants. However,
the alpha level is typically set at 5% or 0.05, while beta is often set at 10, 15, or 20%, depending upon the study and sample
size available.
The two types of error involved in hypothesis testing are shown below.
5
Reality or truth about the population parameter
H0 true
Decision based
on sample and
hypothesis
testing
HA true
Reject H0
Type 1 error (α)
Correct decision
Do not reject H0
Correct Decision
Type II error (β)
Figure 3: Relationship between two types of error in hypothesis testing, the decision to reject or not reject
the null hypothesis, and the truth about the unknown population parameter of study
There is a common analogy used to help explain the concepts of type I and type II error: the fire and fire alarm. In this
example, the state of “no fire” is analogous to the reality that the null hypothesis is true, and the state of “fire” is when, in
reality, there is a real difference between groups.
Pulling the fire alarm is analogous to rejecting the null hypothesis, and not pulling the file alarm is analogous to not
rejecting the null hypothesis. If there is no fire and we do not pull the fire alarm, we are taking the correct action. This
would be analogous to not rejecting the null hypothesis when, in fact, the null hypothesis is true. Pulling the fire alarm
when there is a fire is also a correct action, and this is analogous to rejecting the null hypothesis when it should be rejected.
Pulling the fire alarm when there is no fire is analogous to rejecting the null hypothesis when the null hypothesis is true, a
type I error. Not pulling the fire alarm when there actually is a fire is akin to not rejecting the null hypothesis when the
alternate hypothesis is true, a type II error.
COMMON STATISTICAL TESTS
Putting it All Together
Thus far, we have discussed confidence intervals (previous class), Type I and II error, the p-value, and
power in the context of hypothesis testing in biomedical studies. A logical question that follows would
be to describe exactly how hypothesis testing actually is done in biomedical research.
Basically, there are two methods for hypothesis testing, each yielding identical results. The first is the
method described here, which is through statistical testing. Using statistical tests, such as those that
are described below (e.g. chi square tests, t-tests), the first step is to establish the Type I error rate, or α.
Next, we conduct the test, usually involving some sort of calculation resulting in a test statistic. Using
tables, computation, or computer programs, we find the p-value associated with that test statistic. We
then compare the p-value to the pre-established α. If the p-value is less than α, we reject the null
hypothesis and infer that there is a statistically significant difference association or difference between
groups, depending upon the test. If the p-value is greater than α, we do not reject the null hypothesis
and infer that the difference or association observed between groups is likely due to chance.
The second method, which produces identical results in terms of whether or not the null hypothesis is
rejected is through the use of confidence intervals. This method involves first establishing a “null
value” for the confidence intervals. For differences between group means, this null value is usually 0;
for relative risks and odds ratios, the null value is usually 1. We also establish the α value and
confidence limits accordingly. In other words, when α is set to 0.05, this corresponds to 95%, or 1 – α,
confidence limits. Next, we calculate a point estimate, typically a relative risk, odds ratio, mean, or
any of a number of different statistics. Confidence intervals are then constructed around the point
estimate. If the confidence interval contains the pre-established null value, we do not reject the null
6
hypothesis, and infer that the association or difference observed is likely due to chance. If the
confidence interval excludes the null value, we reject the null hypothesis and infer that the difference
or association is likely not due to chance, but instead due to a real difference or association between
groups. A schematic diagram comparing statistical testing to the method of confidence intervals is
shown in Figure 5.
Figure 5: Schematic diagram illustrating how statistical testing and confidence interval methods are used
to test hypotheses in biomedical research
The “Table 1”
Most articles you will encounter in half-decent biomedical journals will contain a table that lists
descriptive statistics (e.g. means, standard deviations, frequencies, etc...) by treatment or control
group(s). The goal of presenting such tables is generally to compare groups with respect to their
demographic characteristics. As discussed previously in the course, randomization will almost always
ensure that the groups are similar to each other with respect to such characteristics, but in nonrandomized and in some randomized studies, there remain important differences between, for example,
the treatment and control groups, that, if not properly accounted for, may bias the results of the study.
An example of what such a table might look like in a biomedical journal article is displayed in Table 1.
This table is from a cohort study assessing the use of anti-retroviral therapy (ART) in HIV-positive
mothers in reducing transmission of HIV to their infants. In this table, you will note that there are four
columns: one listing the demographic characteristic, one listing the averages or frequencies for the
treatment group, the next listing the averages or frequencies for the control group, and another listing
the p-value for the comparison of the two groups.
Table 1: Baseline characteristics of a sample of mother-infant pairs from HIV-positive mothers
Characteristic
Mean Age (years)
Treatment Group
26.2
Control Group
25.9
p-value
0.21
7
Mean Birthweight of infant (kg)
Married (%)
Male sex of infant (%)
3.1
93.5
51.1
3.1
94.1
50.8
0.76
0.04
0.19
Commonly, two statistical tests are used to test the hypotheses that the two groups are similar with
respect to a variety of demographic and related characteristics, the chi square test and the two
independent t-test, which in this text will be referred to as the “t-test”.
The Chi-Square Test
A chi-square (χ2) test is used in hypothesis testing where the outcome variable and the exposure
variable are both binary (or dichotomous). The chi-square test can also be used when the outcome and
exposure variables are multinomial, meaning that they are categorical variables with more than two
possible values (e.g. an exposure variable of never smoker, current smoker, former smoker). For the
explanation and examples in this text, we will focus exclusively on binary exposure and outcome
variables, however.
The chi square test is used to calculate p-values associated with relative risks and odds ratios in
biomedical studies. It can also be used to determine the p-value in a comparison of percentages or
proportions among two or more groups, which is essentially an adaptation of relative risk. Although
the distributions that underlie the chi-square test are fairly different in appearance and critical values
than the t-distribution family used in the t-test, the procedure for calculating the chi-square statistic are
analogous to the calculation procedure for a t-test.
The first step is to start with the null hypothesis that the risk (or odds or proportions) is the same in
both exposure groups. Then, construct a 2 x 2 contingency table showing how many exposed did or
did not get the outcome and how many unexposed did or did not get the outcome of interest. Based on
these data, we then calculate an expected value of the numbers in each cell if the null hypothesis were
true (i.e. no difference between groups). Next, we compare the expected numbers in each cell of the
contingency table to the actual observed value. We then calculate the chi square statistic based on the
differences between the observed and expected cell counts. If the differences are small, the chi square
statistic would be small, whereas if the differences are larger, the chi-square statistic would be large.
Small chi-square statistics suggest that the differences observed are likely due to chance, but large chisquare statistics suggest that the observed result is not likely due to chance, but instead due to a real
association.
In order to determine whether or not to reject the null hypothesis, the critical value of the chi-square
test must be known. The critical value for the chi-square test depends upon the number of degrees of
freedom, which is derived from the numbers of exposure and outcome levels. The number of degrees
of freedom is just the product of the number of possible exposure levels minus one and the number of
possible outcome levels minus one [d.f. = (# exposure levels – 1) x (# outcome levels – 1)]. For
example, the critical value for a chi-square test with binary exposure and outcome variables is
approximately 3.84. Since chi-square values cannot be negative, the entire rejection region of the chisquare distribution is to the right of the critical value, even for a two-sided hypothesis, as depicted in
the figure below.
8
Probability
Critical value for
1 d.f. = 3.84
0
2
Small χ2
Rejection
region for H0
4
Large χ2
Chi-square value
Figure 4: Schematic diagram of chi-squared test for one degree of freedom
Chi square tests were used to obtain the p-values in Table 1 above for the percent “married” and
percent “male sex of infant” characteristics. For male sex of infant, the p-value was 0.19; this means
that we do not reject the null hypothesis that the distribution of males differs between the control and
treatment groups. On the other hand, the p-value associated with the chi square test for percent
married was 0.04. If we assume that the α value (type I error rate) was 0.05, we would reject the null
hypothesis that the proportion of married women is the same in the treatment and control groups.
The chi-square distribution is actually a family of distributions. As the number of degrees of freedom increase (i.e. from
increasing the number of levels of the exposure and/or outcome variables), the distributions become shifted to the right,
meaning that the critical values increase. This is depicted schematically in the figure below. The figure shows the
distributions of three chi-square distributions, for 2, 4, and 6 degrees of freedom. Note how the critical value for a
theoretical value of alpha increases, or shifts to the right, as the number of degrees of freedom increases.
df = 6
df = 4
df = 2
Chi-square value
Figure 5: Diagram of chi squared distributions with three different degrees of freedom and denotations of
critical values for a theorized value of alpha
For small sample sizes, generally less than 30, a Fisher’s Exact test will be used in place of the chi square test for 2x2 table
comparisons. The Fisher’s Exact test is non-parametric and does not require large sample sizes, as the chi square test does.
The two-sample T-test
The two-sample t-test is used to compare the means of two groups. This implies that the outcome has
to be a continuous variable (so that we can derive the means) and the predictor has to be a 2-level
categorical variable (e.g. treatment vs. placebo; drug A vs. drug B; male vs. female). The
conventional minimal sample size for this test is 15 subjects per group, totally 30.
For instance, in a randomized clinical trial, we wish to measure the effect of the drug Reducagluce by
treating one group of adults with the drug, and the other with placebo. After 9 months of treatment, the
mean fasting blood glucose level in the treatment group was 125 mg/dl, while the mean fasting blood
9
glucose level in the placebo group was 133 mg/dl. The mean difference is 8 mg/dl. However, is this
difference statistically significant? That’s when we should use a t-test.
The null hypothesis,H0 is Mean glucoseReduceagluce = Mean glucoseplacebo. Alternately, we can also
write Mean glucoseReduceagluce – Mean glucoseplacebo = 0, which is closer to the concept of t-test.
The t-test standardizes the difference between groups into a t-value, which will be contrasted with one
of many theoretical t-distributions chosen based on the sample size. If the observed t-value is more
extreme than the threshold associated with an alpha of 0.05, we reject the null hypothesis. Otherwise,
we will not reject the null, and it will remain true.
Below is a slightly simplified version of the formula for a t-test. There are several variations of this
formula, most of which have to do with whether or not the standard deviation are equal in the two
sample groups.
The “expected difference” is usually 0. The standard error is a function of the sample standard
deviations divided by a function of the sample size for each group. The numerator tells you how far the
calculated mean difference is from the hypothesized value under the null hypothesis. The denominator
contains the standard error, which basically standardizes the value obtained in the numerator and puts
that value in standard error units, from which you can obtain the t-value and corresponding p-value.
A graphical depiction of the procedure for the t-test is shown below. If we were to calculate the tvalue for the 8 mg/dl difference in blood glucose levels and found that the t-score is close to the
hypothesized value of 0, we would conclude that chance alone is a better explanation for the observed
difference between the two groups and would not reject the null hypothesis. If, however, the t-value is
relatively far from the hypothesized value of 0, we would conclude that chance alone or random error
does not explain the difference between the groups and reject the null hypothesis in favor of the
alternate hypothesis.
Several factors will increase the t-score for a given null hypothesis:
1. Increasing the sample mean difference
2. Decreasing the standard error, which generally involves (a.) a reduction in standard deviation
or (b.) an increase in sample size
In order to definitely determine whether the observed difference is more likely due to chance or due to
a real difference, a table of probabilities would need to be consulted. These are easily found in most
statistics textbooks and on the internet.
10
μHo = 0
α/2
α/2
t-value or SE units
0
If the t-value falls here,
we say it is too close to
the null value to reject
H0.
If the t-value falls here,
we say it is different
enough from the null
value, so we reject H0.
Figure 6: Schematic diagram of t-test procedure
There are several variations of the t-test you might encounter when reading the medical literature or
conducting research. The following table summarizes the differences.
Two-sample t-test
Paired-sample t-test
ANOVA
(Analysis of variance)
Continuous
Categorical groups
More than 2
Outcome
Exposure
Number of
groups
Groups are…
Continuous
Categorical groups
2
Continuous
Categorical groups
2
Not related. A subject in
group 1 is not associated
with any subject in group 2
Not related. A subject in
group 1 is not associated with
any subject in groups 2, 3, 4,
… k.
Important
statistics
Results
t-statistics
Related. Subjects are either
paired with themselves (prepost study, left/right eyes,
etc.) or subjects are related
in some nature (twins,
couples, etc.)
t-statistics
Allows us to tell if the mean
difference between the two
groups is significant
Wilcoxon Signed Rank test
Allows us to tell if any one of
the means is different from
the others
Kruskal-Wallis test
Non-parametric
analogue
Allows us to tell if the mean
difference between the two
groups is significant
Wilcoxon Rank Sum test
(also called Mann-Whitney
U test)
F-statistics
NOTE: The information on the following three pages is optional and intended to provide a small amount of theoretical
background and conceptual examples for the t-test. You are not required to know the material in this section.
11
On the t-distribution
Another feature of t-test is the use of “t-distribution”. Let’s consider this formula:
The number “±1.96” is the range of the z-score (number of standard deviations away from zero) corresponding to covering
central 95% area of a normal bell-shape curve, leaving 2.5% area at each of the two ends. The rule “68-95-99” is based on a
normal distribution, also known as “Gaussian distribution”. Gaussian distribution is only characterized by mean and
standard deviation and it does not consider sample size. Researchers in the past soon found out that sample means do not
necessarily form a perfect bell curve when the sample size is very small! To describe this process, a new distribution that
changes with sample size was introduced; it is known as the “t-distribution”. The following figure shows the difference
between the normal curve (black solid line) and a t-distribution with a sample size of 10 (red dotted line). The t-distribution
curve is flatter and it extends farther away from the mean, 0. Note that there are many t-distributions (one for each sample
size!), but when the sample size reaches 30 or so, the t-distribution curve will become very close to the normal distribution
curve.
The t-distribution compensates the problem of small sample size by boosting up the multiplier to bigger than 1.96. The
following figure describes the details. When using Gaussian distribution, the 95%CI is always associated with a z-score of
1.96 (indicated by the horizontal dotted line), while if we use t-distribution, the t-value for calculating the 95%CI starts up
much higher at around 2.1, and when the sample size increases, the critical value decreases and eventually becomes very
close to 1.96.
On the assumptions for t-test
In order to conduct a t-test, there is the assumption that the underlying distribution must be approximately normal or
Gaussian. The Central Limit Theorem helps in this regard for large sample sizes, and allows the researcher to use
parametric methods, such as the t-test, even when the underlying distribution is somewhat non-normal, because the sample
means follow a normal distribution. However, in small sample sizes where the underlying distribution is non-normal, the
Central Limit Theorem does not apply easily. Therefore, there is a family of statistical tests used in hypothesis testing
known as non-parametric tests. These tests are essentially analogous to parametric tests, such as t-tests, paired t-tests, and
ANOVA, but before conducting the tests, the observed values are converted to ranks and the tests are based on these ranks,
not the actual observed values. The non-parametric analogues for the independent samples t-test, the paired t-test, and
ANOVA are the Wilcoxon Rank Sum test (also called the Mann-Whitney test), the Wilcoxon Signed Rank test, and the
Kruskall-Wallis test, respectively.
12
T-tests: Additional Examples and Explanation
The syllabus contains an explanation of the assumptions and uses of the two-sample t-test in biomedical studies. This
section will provide some supplementary examples of t-tests to help you understand the concept of a t-test and the methods
by which we use the t-test for hypothesis testing. These examples will walk you through conducting t-tests for two
biomedical studies, while focusing on the key concepts involved in hypothesis testing, and omitting the complex
mathematical details.
Please note that you are not responsible for knowing the detailed formula for the two-sample t-test presented below.
Recall that the steps for the t-test are as follows:
1.
2.
Establish α, the type I error rate. Typically α is 0.05, but can be any value, such as 0.01 or 0.10.
Conduct the t-test to obtain a t-score. The formula for a two-sample t-test is:
t = sample mean difference – ΔμHo
SE(mean difference),
where ΔμHo is the hypothesized population mean difference under the null hypothesis, H0, often 0, and SE(mean
difference) is the standard error of the difference in means, which generally involves pooling the standard
deviations and sample sizes. Knowledge of the details of those formulas for the pooled standard deviation and
sample size is not required for this course.
3.
4.
5.
Obtain degrees of freedom, which is just the total sample size, n, minus 2.
For the t-score and corresponding degrees of freedom, obtain a p-value either from a table or computer program.
If p < α, reject the null hypothesis, and if p > α, do not reject the null hypothesis.
Example: A randomized controlled trial of a new statin, Cholesteroff, was conducted to determine if the new drug reduces
total cholesterol in young adults. After randomization, each patient was assigned either a daily regimen of Cholesteroff or
placebo. We can assume that all bias was removed in the study, and that everybody followed their regimen perfectly.
Part A: Each person had their total cholesterol (TC) taken at baseline. The placebo group, n = 225, had a starting TC of 205
with a standard deviation of 14 mg/dL. The treatment group, n = 240, had a starting TC of 207, with a standard deviation
of 17 mg/dL. Were the groups comparable (the same) at baseline in terms of total cholesterol?
In this case, the null hypothesis, H0, is that the two groups have the same mean TC, μplacebo = μtrt.
The alternate hypothesis, HA, is that there is a difference in TC between the two groups, μplacebo ≠ μtrt
To test these hypotheses, we conduct a two-sample t-test using the steps outlined above.
1.
2.
3.
4.
5.
First, we establish α. Let’s use 0.05.
Second, we calculate the t-score.
t = sample mean difference – ΔμHo = sample mean difference – ΔμHo =
SE(mean difference)
(SD)*√(1/n1 + 1/n2)
(207 – 205) - 0
= 1.35
16*√(1/225 + 1/240)
Calculate degrees of freedom, d.f. = 225 + 240 – 2 = 463.
Obtain corresponding p-value for a t-score of 1.35 and 463 degrees of freedom, which is 0.18.
Since p > α, we do not reject the null hypothesis. There is insufficient evidence to conclude that the difference we
observe in mean TC between placebo and treatment groups at baseline was not due to chance alone. In other
words, we cannot rule out chance as the cause of the difference between the two groups.
Part B: The subjects are followed over a period of 4 months to determine whether or not mean TC was lower in the
Cholesteroff group compared to the placebo group.
We again establish null and alternate hypotheses.
H0: There is no difference between mean cholesterol at 4 months between those on Cholesteroff and those on placebo.
μplacebo = μtrt
13
HA: Mean total cholesterol at 4 months will be different comparing those on Cholesteroff and those on placebo. μplacebo ≠
μtrt
Note: we are comparing those on Cholesteroff at 4 months and those on placebo at 4 months.
The results show that at 4 months after baseline, those on placebo had a mean TC of 203 with a standard deviation of 16
mg/dL. Those on Cholesteroff had a mean TC of 194, with a standard deviation of 20 mg/dL at 4 months.
We again conduct a two-sample t-test.
1.
2.
3.
4.
5.
First, we again establish α. Let’s use 0.05.
Second, we calculate the t-score.
t = sample mean difference – ΔμHo = sample mean difference – ΔμHo =
SE(mean difference)
(SD)*√(1/n1 + 1/n2)
(194 – 203) - 0
= 5.11
18*√(1/225 + 1/240)
Degrees of freedom = 463, same as above.
We then obtain corresponding the p-value for a t-score of 5.11 with 463 degrees of freedom, which is 0.00001
Since p < α, we reject the null hypothesis. There is convincing evidence to suggest that the difference we
observed in TC comparing the placebo to the treatment groups is not likely due to chance alone.
Let’s say that another team of researchers conducted the same study and got the same results, but, since they were poor
researchers, they were only able to recruit a sample size of 16 in each of the study arms.
We can conduct the test in Part B above using the new information.
1.
2.
3.
4.
5.
First, we establish α. Let’s use 0.05 again.
Second, calculate the t-score.
t = sample mean difference – ΔμHo = sample mean difference – ΔμHo =
SE(mean difference)
(SD)*√(1/n1 + 1/n2)
(194 – 203) - 0
= 1.41
18*√(1/16 + 1/16)
Degrees of freedom = 30, which is (16 + 16 – 2)..
We then obtain corresponding the p-value for a t-score of 1.41 with 30 degrees of freedom, which is 0.17
Since p < α, we do not reject the null hypothesis. There is insufficient evidence to rule out chance. This may be
because we do not have enough power to achieve statistical significance.
Dr. Kenneth Chui contributed to the writing of this set of course notes.
14
Download