P - University of Kansas Medical Center

advertisement
Introduction to Biostatistics for Clinical
Researchers
University of Kansas
Department of Biostatistics
&
University of Kansas Medical Center
Department of Internal Medicine
Schedule
Friday, December 10 in 1023 Orr-Major
Friday, December 17 in B018 School of Nursing
Possibility of a 5th lecture, TBD
All lectures will be held from 8:30a - 10:30a
Materials
 PowerPoint files can be downloaded from the Department of
Biostatistics website at http://biostatistics.kumc.edu
 A link to the recorded lectures will be posted in the same location
An Introduction to Hypothesis Testing: The Paired tTest
Topics
 Comparing two groups: the paired-data situation
 Hypothesis testing: the Null and Alternative hypotheses
 Relationships between confidence intervals and hypothesis testing
when comparing means
 P-values: definitions, calculations, and more
The Paired t-test: the Confidence Interval Component
Two Group Designs
 For continuous endpoint: Are the population means different?
 Subjects could be randomized to one of two treatments
(randomized parallel-group design)
 Compare the mean responses from each treatment
 Also referred to as independent groups
 Subjects could each be given both treatments with the ordering of
treatments randomized (paired design)
 Compare the mean difference to zero (or some other
interesting value)
 Pre-post data
 Matched case-control
Example: Pre- versus Post- Data
 Why pair the observations?
 Decrease variability in response
 Each subject acts as its own control (reduced sample sizes)
 Good way to get preliminary data/estimates to develop further
research
Example: Pre- versus Post- Data
 Ten non-pregnant, pre-menopausal women 16-49 years old who
were beginning a regimen of oral contraceptive (OC) use had their
blood pressures measured prior to starting OC use and threemonths after consistent OC use
 The goal of this small study was to see what, if any, changes in
average blood pressure were associated with OC use in such
women
 The data shows the resulting pre- and post-OC use systolic BP
measurements for the 10 women in the study
Blood Pressure and OC Use
Subject
Before OC
After OC
Δ = After - Before
1
115
128
13
2
112
115
3
3
107
106
-1
4
119
128
9
5
115
122
7
6
138
145
7
7
126
132
6
8
105
109
4
9
104
102
-2
10
115
117
2
Average
115.6
120.4
4.8
Blood Pressure and OC Use
 The sample average of the differences is x diff  4.8
 Also note: x diff  x after - x before
 The sample standard deviation (s) of the differences is sdiff = 4.6
 Standard deviation of differences follows the formula:
n
sdiff 
x
i 1
diff
- x diff

2
n -1
where
 each x diff represents an individual difference and
 x diff is the mean difference
Note on Paired Data Designs
 The BP information is essentially reduced from two samples (prior
to and after OC use) into one piece of information--difference in
BP between the two samples
 Response is “within-subject”
 This is standard protocol for comparing paired samples with a
continuous outcome measure
The Confidence Interval Approach
 Suppose we want to draw a conclusion about a population
parameter:
 In a population of women who use OC, is the average change in
blood pressure (after - before) zero?
 The CI approach allows us to create a range of plausible values for
the average change (μΔ) in blood pressure using data from a single,
imperfect, paired sample
The Confidence Interval Approach
 A 95% CI for μΔ in BP in the population of women taking OC is


x diff  t0.95,9 SE x diff
 x diff  t0.95,9

sdiff
10
 4.6 
 4.8  2.26 

10


 1.5,8.1 mmHg
Note
 The number 0 is NOT in the confidence interval (1.5-8.1)
 This suggests there is a non-zero change in BP over time
 The phrase “statistically significant” change is used to indicate
a non-zero mean change
Note
 The BP change could be due to factors other than OC
 Change in weather over pre- and post- period
 Changes in personal stress
 A control group of comparable women who were not taking OC
would strengthen this study
 This is an example of a pilot study-a small study done just to
generate some evidence of a possible association
 This can be followed up with a larger, more scientifically
rigorous study
The Paired t-test: the Hypothesis Testing Component
The Hypothesis Testing Approach
 Suppose we want to draw a conclusion about a population
parameter:
 In a population of women who use OC, is the average change in
blood pressure (after - before) zero?
 The hypothesis testing approach allows us to choose between two
competing possibilities for the average change (μΔ) in blood
pressure using data from a single, imperfect, paired sample
Hypothesis Testing
 Two mutually exclusive, collectively exhaustive possibilities for
“truth” about mean change, μΔ
 Null hypothesis: HO: μΔ = 0 (what we wish to ‘nullify’)
 Alternative hypothesis: HA : μΔ ≠ 0 (what we wish to show
evidence in favor of)
 We use our data as ‘evidence’ in favor of or against the null
hypothesis (and alternative hypothesis, as a result)
Hypothesis Testing
 Null: Typically represents the hypothesis that there is no
association or difference
 HO: μΔ = 0  There is no association between OC use and blood
pressure
 Alternative: The very general complement to the null
 HA: μΔ ≠ 0  There is an association between OC use and blood
pressure
Hypothesis Testing
 Our result will allow us to either reject or fail to reject HO
 We start by assuming HO is true, and ask:
 How likely, given HO is true, is the result we got from our
sample?
 In other words, what are the chances of obtaining the sample
data we actually observed (“evidence”) if the truth is that
there is no association between blood pressure and OC use?
Hypothesis Testing
 HO, in combination with other information about our population
and the size of our sample, sets up (via the CLT) a theoretical
probability distribution of sample means computed from all
possible samples of size n = 10 where µ∆ = 0
Population
distribution
of
BP change
σ∆
µ∆ = 0
Hypothesis Testing
 HO, in combination with other information about our population
and the size of our sample, sets up (via the CLT) a theoretical
probability distribution of sample means computed from all
possible samples of size n = 10 where µ∆ = 0
Sampling
distribution
of
the sample
mean

SE x diff
µ∆ = 0

Hypothesis Testing
 Theoretically, if the null hypothesis were true we would be more
likely to observe values of the sample mean “close” to zero:

SE x diff
µ∆ = 0

Hypothesis Testing
 Theoretically, if the null hypothesis were true it would be unlikely
that we should observe values of the sample mean “far” from
zero:

SE x diff
µ∆ = 0

Hypothesis Testing
 We observed a sample mean of x diff = 4.8 mmHg—is it far enough
from zero for us to conclude in favor of HA?

SE x diff

In favor of HO
In favor of HA
µ∆ = 0
In favor of HA
Hypothesis Testing
 We need some measure of how probable the result from our
sample is given the null hypothesis
 The sampling distribution of the sample mean allows us to evaluate
how unusual our sample statistic is by computing a probability
corresponding to the observed results—the p-value
 If p is small, it suggests that the probability of obtaining the
observed result from the hypothesized distribution is not likely
to occur by chance.
 In other words, either
1. The null hypothesis is actually true and, just by chance,
we got a sample that gave us an unlikely result; or
2. The null hypothesis is actually false, and we got a sample
with evidence of such
Hypothesis Testing
1. The null hypothesis is actually false, and we got a sample with
evidence of such
If we are using a random sample, we can be assured that
this is the case (95% confident, in fact)
Hypothesis Testing
 To compute a p-value, we need to find our value of x diff on the
sampling distribution and figure out how “unusual” it is
 Recall: x diff = 4.8 mmHg

SE x diff
µ∆ = 0

Hypothesis Testing
 Problem: What is σ∆?


SE x diff 
µ∆ = 0

n
Hypothesis Testing
 Solution: the Student’s t distribution



SE x diff 
µ∆ = 0
s
n
Hypothesis Testing
 Where is x diff = 4.8 mmHg located on the sampling distribution (t9)?
1.45
µ∆ = 0
x diff  4.8
Hypothesis Testing
 The p-value is the probability of getting a sample result as (or
more) extreme that what we observed, given the null hypothesis is
true:
1.45
-4.8
µ∆ = 0
4.8
Hypothesis Testing
 The p-value is the area under the curve corresponding to values of
the sample mean more extreme than 4.8
 P(| x diff | ≥ 4.8)
1.45
-4.8
µ∆ = 0
4.8
Hypothesis Testing
 Strictly for convenience, we standardize our distribution
 We center it at zero by subtracting the mean
 We adjust the variability
to correspond to s = 1 by dividing

every observation by SE  x diff 
t
1
-t
µ∆ = 0
+t
x diff - O
 3.3
s
n
Hypothesis Testing
 The p-value is the area under the curve corresponding to values of
the sample mean more extreme than 4.8
 P(|x diff | ≥ 4.8) = P(|t| ≥ 3.3)  easily found in any t table
1
-3.3
µ∆ = 0
+3.3
Hypothesis Testing
 Note: this t is called a test statistic (and is synonymous with a zscore)
 It represents the distance of the observation from the
hypothesized mean in standard errors
 In this case, our mean (4.8) is 3.3 SE away from the
hypothesized mean (0)
 Based on this, what do you think the p-value will look like? Is a
result 3.3 SE above its mean unusual?
Hypothesis Testing
 t = 3.3 on the sampling distribution (t9)
The p-value
 The p-value is the probability of getting a sample result as (or
more) extreme that what we observed, given the null hypothesis is
true
The p-value
 We can look this up in a t-table
 . . . or we can let Excel or another statistical package do it for us
 =TDIST(3.3,9,2)
The p-value
 We can look this up in a t-table
 . . . or we can let Excel or another statistical package do it for us
 =TDIST(3.3,9,2)
Interpreting the p-value
 P = 0.0092: if the true before OC/after OC blood pressure
difference is zero among all women taking OCs, then the chance of
seeing a mean difference as extreme or more extreme than 4.8 in
a sample of 10 women is 0.0092
 We now need to use the p-value to make a decision—either reject
or fail to reject HO
 We need to decide if our sample result is unlikely enough to have
occurred by chance if the null was true
Using the p-value to Make a Decision
 Establishing a cutoff
 In general, to make a decision about what p-value constitutes
“unusual” results, there needs to be a cutoff such that all pvalues less than the cutoff result in rejection of the null
 The standard (but arbitrary) cutoff is p = 0.05
 This cutoff is referred to as the significance level of the test
and is usually represented by α
 For example, α = 0.05
Using the p-value to Make a Decision
 Establishing a cutoff
 Frequently, the result of a hypothesis test with p < 0.05 is
called statistically significant
 At the α = 0.05 level, we have a statistically significant blood
pressure difference in the BP/OC example
Example: BP/OC
 Statistical method
 The changes in blood pressures after oral contraceptive use
were calculated for 10 women
 A paired t-test was used to determine if there was a
statistically significant change in blood pressure, and a 95%
confidence interval was calculated for the mean blood pressure
change (after-before)
 Result
 Blood pressure measurements increased on average 4.8 mmHg
with standard deviation 4.6 mmHg
 The 95% confidence interval for the mean change was 1.5-8.1
mmHg
 The blood pressure measurements after OC use were
statistically significantly higher than before OC use
(p = 0.009)
Example: BP/OC
 Discussion
 A limitation of this study is that there was no comparison group
of women who did not use oral contraceptives
 We do not know if blood pressures may have risen without oral
contraceptive usage
Example: Clinical Agreement
 Two different physicians assessed the number of palpable lymph
nodes in 65 randomly selected male sexual contacts of men with
AIDS or AIDS-related conditions1
1Example
Doctor 1
Doctor 2
Difference
x
7.91
5.16
-2.75
S
4.35
3.93
2.83
based on data taken from Rosner, B. (2005). Fundamentals of Biostatistics (6th
ed.), Duxbury Press
95% Confidence Interval
 A 95% CI for difference in mean number of lymph nodes (Doctor 2
compared to Doctor 1):


x diff  1.99 SE x diff
2.75  1.99
2.83
65
 -3.45, -2.05 

Getting a p-value
 Hypotheses:
 HO: µdiff = 0
 HA : µdiff ≠ 0
1. Assume the null is true
2. Compute the distance in SEs between x diff and the hypothesized
value (zero)
3. The sample result is 7.8 SEs below 0—is this unusual?
t
x diff - O
-2.75

 -7.8
s
2.83
n
65
Getting a p-value
 Sample result is 7.8 SEs below 0—is this unusual?
 See where this result falls on the sampling distribution (t64)
 The p-value corresponds to P(|t|> 7.8)—without looking it up we
know p < 0.001
Example: Oat Bran and LDL Cholesterol
 Cereal and cholesterol: 14 males with high cholesterol given oat
bran cereal as part of diet for two weeks, and corn flakes cereal as
part of diet for two weeks1
1Example
mmol/dL
Corn Flakes
Oat Bran
Difference
x diff
4.44
4.08
0.36
S
1.0
1.1
0.40
based on data taken from Pagano, M. (2000). Principles of Biostatistics (2nd ed.),
Duxbury Press
95% Confidence Interval
 A 95% confidence interval for the difference in mean LDL (corn
flakes versus oat bran):


x diff  t0.95,13 SE x diff
0.36  2.16
0.04
14
 0.13,0.6 

Getting a p-value
 Hypotheses:
 HO: µdiff = 0
 HA : µdiff ≠ 0
1. Assume the null is true
2. Compute the distance in SEs between x diff and the hypothesized
value (zero)
3. The sample result is 3.3 SEs above 0—is this unusual?
t
x diff - O
0.36

 3.3
s
0.4
n
14
Getting a p-value
 Sample result is 3.3 SEs above 0—is this unusual?
 See where this result falls on the sampling distribution (t13)
 The p-value corresponds to P(|t|> 3.3)
 Using a table or software package, we can find p = 0.005
Note on Direction of Comparison
 Whether we chose to examine the difference (oat  corn) or (corn
 oat) makes no difference to our results—only in the appropriate
interpretation of estimates (including confidence intervals)
 The sign of the mean will change
 The limits of the CI will reverse and signs will change
Summary
 Designate hypotheses
 The alternative is usually what we are interested in supporting
 The null is usually what we wish to nullify (“no association/no
change”)
 Collect data
 Compute difference in outcome for each paired set of observations
 Computex diff , the sample mean of the paired differences
 Compute s, the sample standard deviation of the differences
 Compute the 95% (or other %) CI for the true mean difference


x diff  t1-,n -1 SE x diff

Summary
 To get the p-value
 Assume HO is true (sets up the sampling distribution)
 Measure the distance of the sample result from µO
t
x diff - O
s
n
Summary
 Compare the test statistic (distance) to the appropriate
distribution to get the p-value
Summary
 Paired t-test scenarios
 Blood pressure/OC use example
 Degree of clinical agreement (each pt received two
assessments)
 Diet example (each man received two different diets in random
order)
 Twin study
 Matched case-control
 Suppose we wish to compare levels of a certain
biomarker in pts with versus without a disease
More about P-values
P-values
 P-values are probabilities
 Have to be between 0 and 1
 Small p-values mean that the sample results are unlikely when the
null is true
 The p-value is the probability of obtaining a result as extreme or
more extreme than what was actually observed by chance alone,
assuming the null is true
P-values
 The p-value is not the probability that the null hypothesis is true
 It alone imparts no information about scientific/substantive
content in result of a study
 From the previous example, the researchers found a statistically
significant (p = 0.005) difference in average LDL cholesterol levels
in men who had been on a diet including corn flakes versus the
same men on a diet including oat bran cereal
 Which diet showed lower average LDL levels?
 How much was the difference? Does it mean anything
nutritionally?
P-values
 If the p-value is small, either a rare event occurred and
1. HO is false; or
2. HO is true

Type I Error
 Claim HA is true when in fact HO is true
 The probability of making a Type I error is called the
significance level of a test (α)
Note on p and α
 If p < α the result is called statistically significant
 This cutoff is the significance (or alpha) level of the test
 It is the probability of falsely rejecting HO
 The idea is to keep the chances of making an error when HO is true
low and only reject if the sample evidence is highly against HO
Note on p and α
Truth
Decision
HO
HA
Reject HO
Type I Error
(α)
Power
(1-β)
Fail to
Reject HO
Correct
(1-α)
Type II Error
β
One- or Two-sided?
 A two-sided p-value corresponds to results as or more extreme
than what was observed in either direction
 We know a test is two-sided by observing the alternative
hypothesis, HA
 HA containing ≠ indicates we are interested in either an
increase or decrease in (greater or smaller) value than the
hypothesized value
 A one-sided p-value corresponds to results as or more extreme
than what was observed in a single direction of interest
 The direction of interest is stated explicitly in the alternative
hypothesis, HA
 HA containing > indicates we are interested in an increase or
greater value than the hypothesized value
 HA containing < indicates we are interested in a
decrease or smaller value than the hypothesized value
Null and Alternative Sampling Distributions
1-β
β
1-α
α
zα
Fail to reject H0
Conclude no
difference
Reject H0
Conclude difference
Assume H0 is True
1-α
α
zα
Fail to reject H0
Conclude no
difference
Reject H0
Conclude difference
Assume H1 is True
1-β
β
zα
Fail to reject H0
Conclude no
difference
Reject H0
Conclude difference
One- or Two-sided?
 In some cases, a one-sided alternative may not make scientific
sense
 In the absence of pre-existing information for the evaluation of
the relationship between BP and OC, wouldn’t either result be
interesting and useful (i.e., negative or positive association)?
 In some cases, a one-sided alternative often makes scientific sense
 We are not interested if new treatment is worse than old, just
whether it’s statistically significantly better
 However, for reasons already shown (and because of the sanctity
of “.05”), one-sided p-values are viewed with suspicion
Connection: Hypothesis Testing and CIs
 The confidence interval gives plausible values for the population
parameter
 “Data, take me to the truth”
 Hypothesis testing postulates two choices for the population
parameter
 “Here are two mutually exclusive possibilities for the truth—
data help me choose one”
95% Confidence Interval
 If zero is not in the 95% confidence interval, then we would reject
HO: µ = 0 at the 5% level of significance (α = 0.05)
 Why?
 With a confidence interval, we start at the sample mean and go
approximately two standard errors in either direction
95% Confidence Interval
 If zero is not in the 95% confidence interval, then we would reject
HO: µ = 0 at the 5% level of significance (α = 0.05)
 Why?
 With a confidence interval, we start at the sample mean and go
approximately two standard errors in either direction
95% Confidence Interval
 If zero is not in the 95% CI, then this must mean x diff is > ~2 SE away
from zero (either above or below)
 Hence, the distance (t) will be either > 2 or < 2, and the resulting
p-value will be < 0.05
95% Confidence Interval and p-value
 In the BP/OC example, the 95% CI tells us that p < 0.05, but it
doesn’t tell us that it is p = 0.009
 The confidence interval and p-value are complimentary
 However, you can’t get the exact p-value from just looking at a
confidence interval, and you can’t get a sense of the
scientific/substantive significance of your study results by looking
at a p-value
More on the p-value
 Statistical significance does not imply or prove causation
 Example: in the BP/OC case, there could be other factors at play
that could explain the change in blood pressure
 A significant p-value is only ruling out random sampling (chance) as
the explanation
 We would need a randomized comparison group to better establish
causality
 Self-selected would be okay, but not ideal
More on the p-value
 Statistical significance is not the same as scientific significance
 Hypothetical example: blood pressure and oral contraceptives
 Suppose: n = 100,000; x diff = 0.03 mmHg; s = 4.6 mmHg
 P = 0.04
 Big n can sometimes produce a small p-value, even in the absence
of a relationship
 The magnitude of the effect is small (not scientifically
interesting)
 Noise
 It is very important to always report a confidence interval
 95% CI: 0.002-0.058 mmHg
More on the p-value
 Lack of statistical significance is not the same as lack of
scientific significance
 Must evaluate results in the context of the study and sample
size
 Small n can sometimes produce a non-significant result even
though the magnitude of the association at the population level is
real and important—your study just may not be big enough to
detect it
 Underpowered, small studies makes not rejecting hard to
interpret
 Sometimes small studies are designed without power in mind just
to generate preliminary data
Comparing Means among Two (or More) Independent Populations
Topics
 CIs for mean difference between two independent populations
 Two-sample t-test
 Non-parametric alternative
 Comparing means of more than two independent populations
Comparing Two Independent Groups
 “A Low Carbohydrate as Compared with a Low Fat Diet in Severe
Obesity”1
 132 severely obese subjects randomized to one of two diet
groups
 Subjects followed for six months
 At the end of the study period:
“Subjects on the low-carbohydrate diet lost more weight than
those on a low-fat diet (95% CI for the difference in weight loss
between groups, -1.6 to -6.2 kg; p < 0.01)”
1Samaha,
F., et al. A low-carbohydrate as compared with a low-fat diet in severe obesity, NEJM 348:21.
Comparing Two Independent Groups
 Is weight change associated with diet type?
Diet Group
Low-Carb
Low-Fat
Number of subjects (n)
64
68
Mean weight change (kg)
(post - pre)
-5.7
-1.8
Standard deviation of weight
changes (kg)
8.6
3.9
Diet Type and Weight Change
 95% CIs for weight change by diet group:
Carb: - 5.7  1.96
Fat: - 1.8  1.96
8.6
64
3.9
68
  -7.807 kg, -3.593kg 
  -2.728kg, -0.873kg 
Comparing Two Independent Groups
 In statistical terms, is there a non-zero difference in the average
weight change for the subjects on the low-fat diet as compared to
subjects on the low-carbohydrate diet?
 95% CIs for each diet group mean weight change do not
overlap, but how do you quantify the difference?
 The comparison of interest is not “paired”-there are different
subjects in each diet group
 For each subject, a change in weight (post-pre) was computed
 However, the authors compared the changes in weight between
two independent groups
Comparing Two Independent Groups
 How do we calculate
 CI for the difference?
 P-value to determine if the difference in two groups is
significant?
 Since we have large samples (both greater than 60) we know the
sampling distributions of the sample means in both groups are
approximately normal
 It turns out the difference of quantities that are approximately
normally distributed are also normally distributed
Sampling Distribution of Difference in Sample Means
 The sampling distribution of the difference of two sample means,
each based on large samples, approximates a normal distribution
 This sampling distribution is centered at the true mean difference,
μ1 - μ2
Simulated Sampling Distribution
 The simulated sampling distribution of sample mean weight change
for the low-carbohydrate diet group is shown:
Simulated Sampling Distribution
 The simulated sampling distribution of sample mean weight change
for the low-fat diet group is shown:
Simulated Sampling Distribution
 The simulated sampling distribution of the difference in sample
means for the two groups is shown:
Simulated Sampling Distribution
 Side-by-side boxplots
95% CI for the Difference in Means
 Our most general formula is:

best estimate from sample  multiplier SE best estimate from sample 

 The best estimate of a population mean difference based on
sample means:
x1 - x2
 Here, x1 may represent the sample mean weight loss for the 64
subjects on the low-carb diet, and x 2 the mean weight loss for the
68 subjects on the low fat diet
95% CI for the Difference in Means
 So, x1 - x2  -5.7 -  -1.8  -3.9 ; this makes the formula for the 95%
CI for μ1 - μ2 is:
-3.9  1.96SE  x1 - x2 
where SE  x1 - x2  is the standard deviation of the sampling
distribution (i.e., the standard error of the difference of two
sample means)
Two Independent Groups
 The standard error of the difference for two independent samples
is calculated differently than that for the paired design
 With the paired design, we reduced data on two samples to one
set of differences
 Statisticians have developed formulas for the standard error of the
difference-they depend on sample sizes in both groups and
standard deviations in both groups
 Aside: SE  x1 - x2  is greater than either SE  x1  or SE  x2  -any ideas
why?
Principle
 Variation from independent sources can be added
SE  x1 - x 2  
12 22

n1 n2
 We don’t know σ1 or σ2 so we estimate them using s1 and s2 to get
an estimated standard error:

SE  x1 - x 2  
s12 s22

n1 n2
Comparing Two Independent Groups
 Recall from the weight change/diet type study:
Diet Group
Low-Carb
Low-Fat
Number of subjects (n)
64
68
Mean weight change (kg)
(post - pre)
-5.7
-1.8
Standard deviation of weight
changes (kg)
8.6
3.9

SE  x1 - x 2  
s12 s22


n1 n2
8.6 2 3.92

 1.17
64
68
95% CI for Difference in Means
 In this example, the approximate 95% confidence interval for the
true mean difference in weight between the low-carb and low-fat
diet groups is:
-3.9  1.96 1.17    -6.2kg, -1.6kg 
From Article
 “Subjects on the low-carbohydrate diet lost more weight than
those on a low-fat diet (95% CI: -1.6 to 6.2 kg; p < 0.01)”
 Those on the low-carb diet lost more on average by 3.9 kg-after
accounting for sampling variability this excess average loss over
the low-fat diet group could be as small as 1.6 kg or as large as 6.2
kg
 This CI does not include zero, suggesting a real population level
association between type of diet and weight loss
Two-sample t-test: Getting a p-value
Hypothesis Test to Compare Two Independent Groups
 Two-sample t-test
 Is the (mean) weight change equal in the two diet groups?
 HO: μ1 = μ2
 HA: μ1 ≠ μ2
 In other words, is the expected difference in weight change zero?
 HO: μ1 - μ2 = 0
 HA: μ1 - μ2 ≠ 0
Hypothesis Test to Compare Two Independent Groups
 Recall, the general “recipe” for hypothesis testing is:
1. Assume HO is true
2. Measure the distance of the sample result from the
hypothesized result, μO (in most cases it’s 0)
3. Compare the test statistic (distance) to the appropriate
distribution to get the p-value
t
observed difference - null difference

SE  observed difference 
t
x1 - x 2 -  O

SE  x1 - x 2 

x1 - x 2
s12 s22

n1 n2
Diet Type and Weight Loss Study
 Recall:
x1 - x 2  -3.9

SE  x1 - x 2   1.17
 For this study: t  -3.9  -3.33
1.17
 This study result was 3.33 standard errors below the hypothesized
mean of 0-is this result unusual?
How are p-values calculated?
 The p-value is the probability of getting a result as extreme or
more extreme than what you observed if the null hypothesis were
true
 It comes from the sampling distribution of the difference in two
sample means
 What does this sampling distribution look like?
 If both groups are large, it is approximately normal
 It is centered at the true difference
 Under the null, the true difference is 0
Diet/Weight Loss
 To compute the p-value, we would need to compute the
probability of being 3.3 or more SEs away from 0
Diet/Weight Loss
 In Excel, use the “TTEST” function to test whether two
independent samples are significantly different
 If you’ve calculated the test statistic, t (-3.33, in this example),
you can use the “TDIST” function to compute the p-value
 For the diet example, p = 0.0013
Summary: Weight Loss Example
 Statistical Methods
 “We randomly assigned 132 severely obese patients . . . To a
carbohydrate-restricted (low-carbohydrate) diet or a calorieand fat-restricted diet”
 “For comparison of continuous variables between the two
groups, we calculated the change from baseline to six months
in each subject and compared the mean changes in the two
diet groups using an unpaired (two-sample) t-test”
 Result
 “Subjects on the low-carbohydrate diet lost more weight than
those on a low-fat diet (95% CI: -1.6 to -6.2 kg; p < 0.01)”
Sampling Distribution Detail
 What exactly is the sampling distribution of the difference in
sample means?
 A Student’s t distribution is used with n1 - n2 - 2 degrees of
freedom (total sample size minus two)
Two-Sample t-test
 In a randomized design, 23 patients with hyperlipidemia were
randomized to either treatment A or treatment B for 12 weeks
 12 to A
 11 to B
 LDL cholesterol levels (mmol/L) measured on each subject at
baseline and 12 weeks
 The 12-week change in LDL cholesterol was computed for each
subject
Treatment Group
A
B
12
11
Mean LDL change
-1.41
-0.32
Standard
deviation of LDL
changes
0.55
0.65
N
Two-Sample t-test
 Is there a difference in LDL change between the two treatment
groups?
 Methods of inference
 CI for the difference in mean LDL cholesterol change between
the two groups
 Statistical hypothesis test
95% CI for Difference in Means
Treatment Group
A
B
12
11
Mean LDL change
-1.41
-0.32
Standard
deviation of LDL
changes
0.55
0.65
N

x1 - x 2  t1-,n1  n2 -2 SE  x1 - x 2 
0.552 0.652
-1.41 -  -0.32   t1-,n1  n2 -2

12
11
-1.09  t1-,n1  n2 -2  0.25 
95% CI for Difference in Means
 How many standard errors to add and subtract (i.e., what is the
correct multiplier)?
 The number we need comes from a t with 12 + 11 - 2 = 21 degrees
of freedom
 From t table or excel, this value is 2.08
 The 95% CI for true mean difference in change in LDL cholesterol,
drug A to drug B is:
-1.09  2.08  0.25 
 -1.61, -0.57 
Hypothesis Test to Compare Two Independent Groups
 Two-sample (unpaired) t-test: getting a p-value
 Is the change in LDL cholesterol the same in the two treatment
groups?
 HO: μ1 = μ2  HO: μ1 - μ2 = 0
 HA: μ1 ≠ μ2  HA: μ1 - μ2 ≠ 0
Hypothesis Test to Compare Two Independent Groups
 Recall the general “recipe” for hypothesis testing:
1. Assume HO is true
2. Measure the distance of the sample result from the
hypothesized result (here, it’s 0)
3. Compare the test statistic (distance) to the appropriate
distribution to get the p-value
t
observed difference - null difference

SE  observed difference 
t
x1 - x 2 -  O

SE  x1 - x 2 

x1 - x 2
s12 s22

n1 n2
Diet Type and Weight Loss Study
 In the diet types and weight loss study, recall:
x1 - x 2  -1.09

SE  x1 - x 2   0.25
 In this study:
t
-1.09
 -4.4
0.25
 This study result was 4.4 standard errors below the null mean of 0
How are p-values Calculated?
 Is a result 4.4 standard errors below 0 unusual?
 It depends on what kind of distribution we are dealing with
 The p-value is the probability of getting a result as extreme or
more extreme than what was observed (-4.4) by chance, if the
null hypothesis were true
 The p-value comes from the sampling distribution of the difference
in two sample means
 What is the sampling distribution of the difference in sample
means?
 t12 + 11 - 2 = 21
Hyperlipidemia Example
 To compute a p-value, we need to compute the probability of
being 4.4 or more SE away from 0 on the t with 21 degrees of
freedom
P = 0.0003
Summary: Weight Loss Example
 Statistical Methods
 Twenty-three patients with hyperlipidemia were randomly
assigned to one of two treatment groups: A or B
 12 patients were assigned to receive A
 11 patients were assigned to receive B
 Baseline LDL cholesterol measurements were taken on each
subject and LDL was again measured after 12 weeks of
treatment
 The change in LDL cholesterol was computed for each subject
 The mean LDL changes in the two treatment groups were
compared using an unpaired t-test and a 95% confidence
interval was constructed for the difference in mean LDL
changes
Summary: Weight Loss Example
 Result
 Patients on A showed a decrease in LDL cholesterol of 1.41
mmol/L and subjects on treatment B showed a decrease of
0.32 mmol/L (a difference of 1.09 mmol/L, 95% CI: 0.57 to
1.61 mmol/L)
 The difference in LDL changes was statistically significant (p <
0.001)
FYI: Equal Variances Assumption
 The “traditional” t-test assumes equal variances in the two groups
 This can be formally tested using another hypothesis test
 But why not just compare observed values of s1 to s2?
 There is a slight modification to allow for unequal variances-this
modification adjusts the degrees of freedom for the test, using
slightly different SE computation
 If you want to be truly ‘safe’, it is more conservative to use the
test that allows for unequal variances
 Makes little to no difference in large samples
FYI: Equal Variances Assumption
 If underlying population level standard deviations are equal, both
approaches give valid confidence intervals, but intervals assuming
unequal standard deviations are slightly wider (p-values slightly
larger)
 If underlying population level standard deviations are unequal, the
approach assuming equal variances does not give valid confidence
intervals and can severely under-cover the goal of 95%
Non-Parametric Analogue to the Two-Sample t
Alternative to the Two Sample t-test
 “Non-parametric” refers to a class of tests that do not assume
anything about the distribution of the data
 Nonparametric tests for comparing two groups
 Mann-Whitney Rank-Sum test (Wilcoxon Rank Sum Test)
 Also called Wilcoxon-Mann-Whitney Test
 Attempts to answer: “Are the two populations distributions
different?”
 Advantages: does not assume populations being compared are
normally distributed, uses only ranks, and is not sensitive to
outliers
Alternative to the Two Sample t-test
 Disadvantages:
 often less sensitive (powerful) for finding true differences
because they throw away information (by using only ranks
rather than the raw data)
 need the full data set, not just summary statistics
 results do not include any CI quantifying range of possibility for
true difference between populations
Health Education Study
 Evaluate an intervention to educate high school students about
health and lifestyle over a two-month period
 10 students randomized to intervention or control group
 X = post-test score - pre-test score
 Compare between the two groups
Health Education Study
• Only five individuals in each sample
• We want to compare the control and intervention to assess
whether the ‘improvement’ in scores are different, taking random
sampling error into account
Intervention 5
0
7
2
19
Control
-5
-6
1
4
6
• With such a small sample size, we need to be sure score
improvements are normally distributed if we want to use the t test
(BIG assumption)
• Possible approach: Wilcoxon-Mann-Whitney test
Health Education Study
 Step 1: rank the pooled data, ignoring groups
Intervention 5
0
7
2
19
Control
-5
-6
1
4
Intervention 7
3
9
5
10
Control
2
1
4
6
6
8
 Step 2: reattach group status
 Step 3: find the average rank in each of the two groups
3  5  7  9  10
 6.8
5
1 2  4  6  8
 4.2
5
Health Education Study
 Statisticians have developed formulas and tables to determine the
probability of observing such an extreme discrepancy in ranks (6.8
versus 4.2) by chance alone (p)
 The p-value here is 0.17
 The interpretation is that the Mann-Whitney test did not show
any significant difference in test score ‘improvement’ between
the intervention and control group (p = 0.17)
 The two-sample t test would give a different answer (p = 0.14)
 Different statistical methods give different p-values
 If the largest observation was changed, the MW p would
not change but the t p-value would
Notes
 The t or the nonparametric test?
 Statisticians will not always agree, but there are some
guidelines
 Use the nonparametric test if the sample size is small and you
have no reason to believe data is ‘well-behaved’ (normally
distributed)
 Only ranks are available
Summary: Educational Intervention Example
 Statistical methods
 10 high school students were randomized to either receive a
two-month health and lifestyle education program or no
program
 Each student was administered a test regarding health and
lifestyle issues prior to randomization and after the two-month
period
 Differences in the two test scores were computed for each
student
 Mean and median test score changes were computed for each
of the two study groups
 A Mann-Whitney rank sum test was used to determine if there
was a statistically significant difference in test score change
between the intervention and control groups at the end of the
two-month study period
Summary: Educational Intervention Example
 Results
 Participants randomized to the educational intervention scored
a median five points higher on the test given at the end of the
two-month study period, as compared to the test administered
prior to the intervention
 Participants randomized to receive no educational intervention
scored a median one point higher on the test given at the end
of the two-month study period
 The difference in test score improvements between the
intervention and control groups was not statistically significant
(p = 0.17)
Next Lecture
 Friday, December 17 in B018 SON from 8:30a - 10:30a
 Topics include
― ANOVA
― Linear Regression
― Chi-square test
― Survival Analysis
― Design of Experiments
References and Citations
Lectures modified from notes provided by John McGready and Johns
Hopkins Bloomberg School of Public Health accessible from the World
Wide Web: http://ocw.jhsph.edu/courses/introbiostats/schedule.cfm
Download