H - University of Kansas Medical Center

advertisement
Introduction to Biostatistics
for Clinical and Translational
Researchers
KUMC Departments of Biostatistics & Internal Medicine
University of Kansas Cancer Center
FRONTIERS: The Heartland Institute of Clinical and Translational Research
Course Information
 Jo A. Wick, PhD
 Office Location: 5028 Robinson
 Email: jwick@kumc.edu
 Lectures are recorded and posted at
http://biostatistics.kumc.edu under ‘Events and
Opportunities’
Inferences:
Hypothesis Testing
Experiment
 An experiment is a process whose results are not
known until after it has been performed.
 The range of possible outcomes are known in advance
 We do not know the exact outcome, but would like to
know the chances of its occurrence
 The probability of an outcome E, denoted P(E), is
a numerical measure of the chances of E
occurring.
 0 ≤ P(E) ≤ 1
Probability
 The most common definition of probability is the
relative frequency view:
# of times x = a
P  x = a =
total # of observations of x
 Probabilities for the outcomes of a random variable
0.2
P(x)
0.05
10
0.10
15
Length of stay = 6 days
0.00
0.0
5
0
Probability of
LOS = 6 days
0.15
0.4
x are represented through a probability distribution:
1
0 -4
2
3 2
4
-2 54
6
6 7
08
x
8
9
10
10
11 2
1212
14
4
Population Parameters
 Most often our research questions involve
unknown population parameters:
What is the average BMI among 5th graders?
What proportion of hospital patients acquire a hospitalbased infection?
 To determine these values exactly would require a
census.
 However, due to a prohibitively large population (or
other considerations) a sample is taken instead.
Sample Statistics
 Statistics describe or summarize sample
observations.
 They vary from sample to sample, making them
random variables.
 We use statistics generated from samples to make
inferences about the parameters that describe
populations.
Sampling Variability
Samples
μ σ
x  0 s 1
x  0.15 s  1.1
Population
Sampling Distribution
x
of
x  0.1 s  0.98
Recall: Hypotheses
 Null hypothesis “H0”: statement of no
differences or association between variables
 This is the hypothesis we test—the first step in the
‘recipe’ for hypothesis testing is to assume H0 is true
 Alternative hypothesis “H1”: statement of
differences or association between variables
 This is what we are (usually) trying to prove
Hypothesis Testing
 One-tailed hypothesis: outcome is expected in a
single direction (e.g., administration of
experimental drug will result in a decrease in
systolic BP)
 H1 includes ‘<‘ or ‘>’
 Two-tailed hypothesis: the direction of the effect is
unknown (e.g., experimental therapy will result in a
different response rate than that of current
standard of care)
 H1 includes ‘≠‘
Hypothesis Testing
 The statistical hypotheses are statements
concerning characteristics of the population(s) of
interest:
 Population mean: μ
 Population variability: σ
 Population rate (or proportion): π
 Population correlation: ρ
 Example: It is hypothesized that the response rate
for the experimental therapy is greater than that of
the current standard of care.
 πExp > πSOC ← This is H1.
Recall: Decisions
 Type I Error (α): a true H0 is incorrectly rejected
 “An innocent man is proven GUILTY in a court of law”
 Commonly accepted rate is α = 0.05
 Type II Error (β): failing to reject a false H0
 “A guilty man is proven NOT GUILTY in a court of law”
 Commonly accepted rate is β = 0.2
 Power (1 – β): correctly rejecting a false H0
 “Justice has been served”
 Commonly accepted rate is 1 – β = 0.8
Decisions
Truth
Conclusion
H1
H0
H1
Correct: Power
Type I Error
H0
Type II Error
Correct
Basic Recipe for Hypothesis
Testing
1. State H0 and H1
2. Assume H0 is true ← Fundamental assumption!!
3. Collect the evidence—from the sample data,
compute the appropriate sample statistic and the
test statistic

Test statistics quantify the level of evidence within the
sample—they also provide us with the information for
computing a p-value (e.g., t, chi-square, F)
4. Determine if the test statistic is large enough to
meet the a priori determined level of evidence
necessary to reject H0 (. . . or, is p < α?)
Example: Carbon Monoxide
 An experiment is undertaken to determine the
concentration of carbon monoxide in air.
 It is a concern that the actual concentration is
significantly greater than 10 mg/m3.
 Eighteen air samples are obtained and the
concentration for each sample is measured.
 The outcome x is carbon monoxide concentration in
samples.
 The characteristic (parameter) of interest is μ—the true
average concentration of carbon monoxide in air.
Step 1: State H0 & H1
 H1: μ > 10 mg/m3 ← We suspect!
0.4
 H0: μ ≤ 10 mg/m3 ← We assume in order to test!
0.2
0.0
P(x)
Step 2: Assume μ = 10
-4
-2
μ = 10
0
x
2
4
Step 3: Evidence
10.25
10.37
10.66
10.47
10.56
10.22
10.44
10.38
10.63
10.40
10.39
10.26
10.32
10.35
10.54
10.33
10.48
10.68
Sample statistic: x = 10.43
Test statistic: t 
What does 1.79 mean? How do we use it?
x  μ0 10.43  10

 1.79
s
1.02
n
18
Student’s t Distribution
0.4
 Remember when we assumed H0 was true?
0.2
0.0
P(x)
Step 2: Assume μ = 10
-4
-2
μ = 10
0
x
2
4
Student’s t Distribution
 What we were actually doing was setting up this
theoretical Student’s t distribution from which the pvalue can be calculated:
xμ
10  10
0
s
0.2
n
0.0
P(x)
0.4
t
-4
-2
0
t=0
x
2
4

0
1.02
18
Student’s t Distribution
 Assuming the true air concentration of carbon
0.4
monoxide is actually 10 mg/mm3, how likely is it
that we should get evidence in the form of a
sample mean equal to 10.43?
0.2
P  x  10.43 ?
0.0
P(x)
Step 2: Assume μ = 10
-4
-2
μ = 10
0
2
x
x =10.43
4
Student’s t Distribution
 We can say how likely by framing the statement in
terms of the probability of an outcome:
0.2
x  μ0
10  10

0
s
1.02
n
18
p = P(t ≥ 1.79) = 0.0456
0.0
P(x)
0.4
t
-4
-2
0
t=0
2
x
t = 1.79
4
Step 4: Make a Decision
 Decision rule: if p ≤ α, the chances of getting the
actual collected evidence from our sample given
the null hypothesis is true are very small.
 The observed data conflicts with the null ‘theory.’
 The observed data supports the alternative ‘theory.’
 Since the evidence (data) was actually observed and our
theory (H0) is unobservable, we choose to believe that
our evidence is the more accurate portrayal of reality and
reject H0 in favor of H1.
Step 4: Make a Decision
 What if our evidence had not been in as great of
degree of conflict with our theory?
 p > α: the chances of getting the actual collected
0.2
P  x  10.1 ?
0.0
P(x)
0.4
evidence from our sample given the null hypothesis is
true are pretty high
 We fail to reject H0.
-4
-2
10
0
x
2
x = 10.1
4
Decision
 How do we know if the decision we made was the
correct one?
 We don’t!
 If α = 0.05, the chances of our decision being an incorrect
reject of a true H0 are no greater than 5%.
 We have no way of knowing whether we made this kind
of error—we only know that our chances of making it
in this setting are relatively small.
Which test do I use?
 What kind of outcome do you have?
 Nominal? Ordinal? Interval? Ratio?
 How many samples do you have?
 Are they related or independent?
Types of Tests
One Sample
Measurement
Level
Population
Parameter
Hypotheses
Sample
Statistic
Nominal
Proportion
π
H0: π = π0
H1: π ≠ π0
Ordinal
Median M
H0: M = M0
H1: M ≠ M0
Interval
Mean μ
H0: μ = μ0
H1: μ ≠ μ0
x
Student’s t or Wilcoxon (if non-normal or
small n)
Ratio
Mean μ
H0: μ = μ0
H1: μ ≠ μ0
x
Student’s t or Wilcoxon (if non-normal or
small n)
p=
x
n
m = p50
Inferential Method(s)
Binomial test or
z test (if np > 10 & nq > 10)
Wilcoxon signed-rank test
Types of Tests
 Parametric methods: make assumptions about
the distribution of the data (e.g., normally
distributed) and are suited for sample sizes large
enough to assess whether the distributional
assumption is met
 Nonparametric methods: make no assumptions
about the distribution of the data and are suitable
for small sample sizes or large samples where
parametric assumptions are violated
 Use ranks of the data values rather than actual data
values themselves
 Loss of power when parametric test is appropriate
Types of Tests
Two Independent Samples
Measurement
Level
Population
Parameters
Hypotheses
Sample
Statistics
Nominal
π1, π2
H0: π1 = π2
H1: π1 ≠ π2
Ordinal
M1, M2
H0: M1 = M2
H1: M1 ≠ M2
m1, m2
Median test
Interval
μ1, μ2
H0: μ1 = μ2
H1: μ1 ≠ μ2
x1
x2
Student’s t or Mann-Whitney (if non-normal,
unequal variances or small n)
Ratio
μ1, μ2
H0: μ1 = μ2
H1: μ1 ≠ μ2
x1
x2
Student’s t or Mann-Whitney (if non-normal,
unequal variances or small n)
p1 =
x1
n1
p2 =
Inferential Method(s)
x2
n2
Fisher’s exact or Chi-square (if cell counts >
5)
Comparing Central Tendency
# Groups
2
>2
Normal or large
n
Independent
Samples
Dependent
Samples
2-sample t
Non-normal or
small n
Independent
Samples
Paired t
Normal or large
n
Dependent
Samples
Wilcoxon
Signed-Rank
Independent
Samples
Wilcoxon RankSum
Non-normal or
small n
Dependent
Samples
ANOVA
Independent
Samples
2-way ANOVA
Dependent
Samples
Kruskal-Wallis
Friedman’s
Two-Sample Test of Means
 Clotting times (minutes) of blood for subjects given
one of two different drugs:
Drug B
Drug G
8.8
8.4
9.9
9.0
7.9
8.7
11.1
9.6
9.1
9.6
8.7
10.4
x1  8.75
x2  9.74
9.5
 It is hypothesized that the two drugs will result in
different blood-clotting times.
 H1: μB ≠ μG
 H0: μB = μG
Two-Sample Test of Means
 What we’re actually hypothesizing: H0: μB  μG = 0
0.4
x1  x2  0.99
0.2
0.0
P(x)
Evidence!
-4
-2
μB  μG = 0
0
2
4
x
P  x1  x2  0.99   ?
P  x1  x2  0.99   ?
Two-Sample Test of Means
 What we’re actually hypothesizing: H0: μB  μG = 0
p = P(|t| > 2.475) = 0.03
x1  x2
0.2
s12 s22

n1 n2
0.0
P(x)
0.4
t
-4
t = 2.48
-2
0
t=0
x
***Two-sided tests detect ANY evidence in EITHER
direction that the null difference is unlikely!
2
t = +2.48
4

8.75  9.74
 2.475
0.40
Assumptions of t
 In order to use the parametric Student’s t test, we
have a few assumptions that need to be met:
 Approximate normality of the observations
 In the case of two samples, approximate equality of the
sample variances
Assumption Checking
 To assess the assumption of normality, a simple
histogram would show any issues with skewness
or outliers:
Assumption Checking
 Skewness
Assumption Checking
 Other graphical assessments include the QQ plot:
Assumption Checking
 Violation of normality:
Assumption Checking
 To assess the assumption of equal variances
(when groups = 2), simple boxplots would show
any issues with heteroscedasticity:
Assumption Checking
 Rule of thumb: if
the larger variance is
more than 2 times
the smaller, the
assumption has been
violated
Now what?
 If you have enough observations (20? 30?) to be
able to determine that the assumptions are
feasible, check them.
 If violated:
• Try a transformation to correct the violated assumptions (natural
log) and reassess; proceed with the t-test if fixed
• If a transformation doesn’t work, proceed with a non-parametric test
• Skip the transformation altogether and proceed to the nonparametric test
 If okay, proceed with t-test.
Now what?
 If you have too small a sample to adequately
assess the assumptions, perform the nonparametric test instead.
 For the one-sample t, we typically substitute the Wilcoxon
signed-rank test
 For the two-sample t, we typically substitute the MannWhitney test
Consequences of Nonparametric
Testing
 Robust!
 Less powerful because they are based on ranks
which do not contain the full level of information
contained in the raw data
 When in doubt, use the nonparametric test—it will
be less likely to give you a ‘false positive’ result.
Speaking of Power
 “How many subjects do we need?”
 Statistical methods can be used to determine the
required number of patients to meet the trial’s
principal scientific objectives.
 Other considerations that must be accounted for
include availability of patients and resources and
the ethical need to prevent any patient from
receiving inferior treatment.
 We want the minimum number of patients required to
achieve our principal scientific objective.
The Size of a Clinical Trial
 For the chosen level of significance (type I error
rate, α), a clinically meaningful difference (Δ)
between two groups can be detected with a
minimally acceptable power (1 – β) with n
subjects.
Example: Detecting a Difference
 Primary objective: To compare pain improvement
in knee OA for new treatment A compared to
standard treatment S.
 Primary outcome: Change in pain score from
baseline to 24 weeks (continuous).
 Data analysis: Comparison of mean change in
pain score of patients on treatment A (μ1) versus
standard (μ2) using a two-sided t-test at the α =
0.05 level of significance.
Example: Detecting a Difference
 Difference to detect (Δ): It has been determined
that a difference of 10 on this pain scale is clinically
meaningful.
 If standard therapy results in a 5 point decrease, our new
therapy would need to show a decrease of at least 15 (5
+ 10) to be declared clinically different from the
standard.
 We would like to be 80% sure that we detect this
difference as statistically significant.
Example: Detecting a Difference
 What usually occurs on the standard?
 This is important information because it tells us about the
behavior of the outcome (pain scale) in these patients.
 If the pain scale has great variability, it may be difficult to
detect small to moderate changes (signal-to-noise)!
Change in Pain from Baseline
Difference = 20!
S
A
0
Change in Pain from Baseline
‘Signal-to-Noise’
Difference = 20!
S
A
Example: Detecting a Difference
 We have:
 H0: μ1 = μ2 versus H1: μ1  μ2 (Δ= 0)
 α = 0.05
 1 – β = 0.80
 Δ = 10
 For continuous outcomes we need to determine
what difference would be clinically meaningful, but
specified in the form of an effect size which takes
into account the variability of the data.
Example: Detecting a Difference
 Effect size is the difference in the means divided
by the standard deviation, usually of the control or
comparison group, or the pooled standard
deviation of the two groups
d
where
1   2

12 22


n1 n2
Example: Detecting a Difference
 Power Calculations  an interesting interactive
web-based tool to show the relationship between
power and the sample size, variability, and
difference to detect.
 A decrease in the variability of the data results in
an increase in power for a given sample size.
 An increase in the effect size results in a decrease
in the required sample size to achieve a given
power.
 Increasing α results in an increase in the required
sample size to achieve a given power.
Inferences on Two Means
 Example: Smoking cessation
 Two types of therapy: x = {behavioral therapy, literature}
 Dependent variable: y = % decrease in number of
cigarettes smoked per day after six months of therapy
Behavioral Therapy
Literature Only
10
6
20
2
65
0
0
12
30
4
Smoking Cessation
 Research question: Is behavioral therapy in
addition to education better than education alone
in getting smokers to quit?
 H0: μ1 = μ2 versus H1: μ1 ≠ μ2
 Two independent samples t-test IF:
 the change is approximately normal OR can be
transformed to an approximate normal distribution (e.g.,
natural log)
 the variability within each group is approximately the
same (ROT: no more than 2x difference)
Smoking Cessation
Reject H0: μ1 = μ2
 Conclusion: Adding behavioral therapy to
cessation education results in—on average—a
greater reduction in cigarettes smoked per day at
six months post-therapy when compared to
education alone (t30.9 = 2.87, p < 0.01).
Smoking Cessation
 The 95% confidence interval is:
8.39 ≤ μ1  μ2 ≤ 1.42
 Interpretation: On average, behavioral therapy
resulted in an additional reduction of 4.9% (95%CI:
1.42%, 8.39%) relative to control.
Confidence Intervals
 What exactly do confidence intervals represent?
 Remember that theoretical sampling distribution
concept?
 It doesn’t actually exist, it’s only mathematical.
 What would we see if we took sample after sample after
sample and did the same test on each . . .
Confidence Intervals
 Suppose we actually took sample after sample . . .
 100 of them, to be exact
 Every time we take a different sample and compute the
confidence interval, we will likely get a slightly different
result simply due to sampling variability.
Confidence Intervals
 Suppose we actually took sample after sample . . .
 100 of them, to be exact
 95% confident means: “In 95 of the 100 samples, our
interval will contain the true unknown value of the
parameter. However, in 5 of the 100 it will not.”
Confidence Intervals
 Suppose we actually took sample after sample . . .
 100 of them, to be exact
 Our “confidence” is in the procedure that produces the
interval—i.e., it performs well most of the time.
 Our “confidence” is not directly related to our particular
interval—we cannot say “The probability that the mean
difference is between (1.4,8.4) is 0.95.”
Inferences on More Than Two
Means
 Example: Smoking cessation
 Three types of therapy: x = {pharmaceutical therapy,
behavioral therapy, literature}
 Dependent variable: y = % decrease in number of
cigarettes smoked per day after six months of therapy
Pharmaceutical Therapy
Behavioral Therapy
Literature Only
10
10
6
30
0
20
60
6
0
32
0
12
65
30
4
Smoking Cessation
 Research question: Is therapy in addition to
education better than education alone in getting
smokers to quit? If so, is one therapy more
effective?
 H0: μ1 = μ2 = μ3 versus H1: At least one μ is different
 More than 2 independent samples requires an
ANOVA:
 the change is approximately normal OR can be
transformed to an approximate normal distribution (e.g.,
natural log)
 the variability within each group is approximately the
same (ROT: no more than 2x difference)
Smoking Cessation
 ANOVA produces a table:
 One-way ANOVA indicates you have a single
categorical factor x (e.g., treatment) and a single
continuous response y and your interest is in
comparing the mean response μ across the levels
of the categorical factor.
Wait . . .
 Why is ANOVA using variances when we’re
hypothesizing about means?
 Between-groups mean square: a variance
 Within-groups mean square: also a variance
 F: a ratio of variances—F = MSBG/MSWG
What’s the Rationale?
 In the simplest case of the one-way ANOVA, the
variation in the response y is broken down into
parts: variation in response attributed to the
treatment (group/sample) and variation in
response attributed to error (subject
characteristics + everything else not controlled for)
 The variation in the treatment (group/sample) means is
compared to the variation within a treatment
(group/sample) using a ratio—this is the F test statistic!
 If the between treatment variation is a lot bigger than the
within treatment variation, that suggests there are some
different effects among the treatments.
Rationale
1
2
3
Rationale
 There is an obvious difference between scenarios
1 and 2. What is it?
 Just looking at the boxplots, which of the two
scenarios (1 or 2) do you think would provide more
evidence that at least one of the populations is
different from the others? Why?
Rationale
1
2
3
F Statistic
F=
Variation between the sample means
Natural variation within the samples
 Case A: If all the sample means were exactly the
same, what would be the value of the numerator of
the F statistic?
 Case B: If all the sample means were spread out
and very different, how would the variation
between sample means compare to the value in
A?
F Statistic
F=
Variation between the sample means
Natural variation within the samples
 So what values could the F statistic take on?
 Could you get an F that is negative?
 What type of values of F would lead you to believe
the null hypothesis—that there is no difference in
group means—is not accurate?
Smoking Cessation
 ANOVA produces a table:
 Conclusion: Reject H0: μ1 = μ2 = μ3. Some
difference in the number of cigarettes smoked per
day exists between subjects receiving the three
types of therapy.
Smoking Cessation
 ANOVA produces a table:
 But where is the difference? Are the two
experimental therapies different? Or is it that each
are different from the control?
Smoking Cessation
 Reject H0: μ1 = μ3 and μ1 = μ2. Both pharmaceutical
and behavioral therapy are significantly different
from the literature only control group, but the two
therapies are not different from each other.
Smoking Cessation
 Conclusion: Adding either behavioral (p = 0.015) or
pharmaceutical therapy (p < 0.01) to cessation
education results in—on average—significantly greater
decreases in cigarettes smoked per day at six months
post-therapy when compared to education alone.
Inferences on Means
 Concerns a continuous response y
 One or two groups: t
 More than two groups: ANOVA
 Remember, this (and the two-sample case) is essentially
looking at the association between an x and a y, where x
is categorical (nominal or ordinal) and y is continuous
(interval or ratio).
 Check assumptions!
 Normality of y
 Equal group variances
ANOVA Models
 There are many . . .
Randomized designs with one treatment
A. Subjects not subdivided on any basis other than randomization prior to assignment to treatment
levels; no restriction on random assignment other than the option of assigning the same number of
subjects to each treatment level
1. Completely randomized or one factor design
B. Subjects subdivided on some nonrandom basis or one or more restrictions on random assignment
other than assigning the same number of subjects to each treatment level
1. Balanced incomplete block design
2. Crossover design
3. Generalized randomized block design
4. Graeco-Latin square design
5. Hyper-Graeco-Latin square design
6. Latin square design
7. Partially balanced incomplete block design
8. Randomized block design
9. Youden square design
Randomized designs with two or more treatments
A. Factorial experiments: designs in which all treatment levels are crossed
1. Designs without confounding
a. Completely randomized factorial design
b. Generalized randomized factorial design
c. Randomized block factorial design
2. Design with group-treatment confounding
a. Split-plot factorial design
3. Designs with group-interaction confounding
a. Latin square confounded factorial design
b. Randomized block completely confounded factorial design
c. Randomized block partially confounded factorial design
4. Designs with treatment-interaction confounding
a. Completely randomized fractional factorial design
Inferences on Proportions (k = 2)
 Example: plant genetics
 Two phenotypes: x = {yellow-flowered plants, greenflowered plants}
 Dependent variable: y = proportion of plants out of 100
progeny that express each phenotype
Phenotype
Yellow
Yellow
Green
Yellow
Green
x
y=
n
Plant Genetics
 The plant geneticist hypothesizes that his crossed
progeny will result in a 3:1 phenotypic ratio of
yellow-flowered to green-flowered plants.
 H0: The population contains 75% yellow-flowered
plants versus H1: The population does not contain
75% yellow-flowered plants.
 H0: πy = 0.75 versus H1: πy ≠ 0.75
 This particular type of test is referred to as the chi-
square goodness of fit test for k = 2.
Plant Genetics
 Chi-square statistics compute deviations between
what is expected (under H0) and what is actually
observed in the data:
 
2
x
O  E 
2
E
 DF = k – 1 where
k is number of
categories of x
Plant Genetics
 Suppose the researcher actually observed in his
sample of 100 plants this breakdown of phenotype:
Phenotype
f (%)
Yellow-flowered
84 (84%)
Green-flowered
16 (16%)
 Does it appear that this type of sample could have
come from a population where the true proportion
of yellow-flowered plants is 75%?
Plant Genetics
Phenotype
f (%)
Yellow-flowered
84 (84%)
Green-flowered
16 (16%)

2
1
84  75 


75
2
16  25 


25
2
 4.32
 Conclusion: Reject H0: πy = 0.75—it does not
appear that the geneticist’s hypothesis about the
population phenotypic ratio is correct (p = 0.038).
Inferences on Proportions (k > 2)
 Example: plant genetics
 Four phenotypes: x = {yellow-smooth flowered, yellowwrinkled flowered, green-smooth flowered, greenwrinkled flowered}
 Dependent variable: y = proportion of plants out of 250
progeny that express each phenotype
Phenotype
Yellow smooth
Yellow smooth
Green wrinkled
Yellow wrinkled
x
y=
n
Plant Genetics
 The plant geneticist hypothesizes that his crossed
progeny will result in a 9:3:3:1 phenotypic ratio of
YS:YW:GS:GW plants.
 Actual numeric hypothesis is H0: π1 = 0.5625, π2 =
0.1875, π3 = 0.1875, π4 = 0.0625
 This particular type of test is referred to as the chisquare goodness of fit test for k = 4.
Plant Genetics
 Chi-square statistics compute deviations between
what is expected (under H0) and what is actually
observed in the data:
 
2
x
O  E 
2
E
 DF = k – 1 where
k is number of
categories of x
Plant Genetics
 Suppose the researcher actually observed in his
sample of 250 plants this breakdown of phenotype:
Phenotype
f (%)
YS
152 (60.8%)
YW
39 (15.6%)
GS
53 (21.2%)
GW
6 (2.4%)
 Does it appear that this type of sample could have
come from a population where the true phenotypic
ratio is as the geneticist hypothesized?
Plant Genetics
Phenotype
f (%)
YS
152 (60.8%)
YW
39 (15.6%)
GS
53 (21.2%)
GW
6 (2.4%)
32  8.972
 Conclusion: Reject H0—it does not appear that the
geneticist’s hypothesis about the population
phenotypic ratio is correct (p = 0.03).
Inferences on Proportions
 Concerns a categorical response y
 Regardless of the number of groups, a chi-square
test may be used
 Remember, this is essentially looking at the association
between an x and a y, where x is categorical (nominal or
ordinal) and y is categorical (nominal or ordinal).
 Assumptions?
 ROT: No expected frequency should be less than 5 (i.e.,
nπ < 5)
 If not met, use the binomial (k = 2) or multinomial (k > 2)
test
Inferences on Proportions
 What do we do when we have nominal data on
more than one factor x?
 Gender and hair color
 Menopausal status and disease stage at diagnosis
 ‘Handedness’ and gender
 We still use chi-square!
 These types of tests are looking at whether two
categorical variables are independent of one
another—thus, tests of this type are often referred
to as chi-square tests of independence.
Inferences on Proportions
 Example: Hair color and Gender
 Gender: x1 = {M, F}
 Hair Color: x1 = {Black, Brown, Blonde, Red}
Male
Female
Total
Black
Brown
Blonde
Red
Total
32 (32%)
43 (43%)
16 (16%)
9 (9%)
100
64 (32%)
16 (8%)
200
80
25
N = 300
55 (27.5%) 65 (32.5%)
87
108
What the data should look like
in the actual dataset:
Gender
Hair
Color
Male
Black
Female
Red
Female
Blonde
Hair Color and Gender
 The researcher hypothesizes that hair color is not
independent of sex.
 H0: Hair color is independent of gender (i.e., the
phenotypic ratio is the same within each gender).
 H1: Hair color is not independent of gender (i.e.,
the phenotypic ratio is different between genders).
Hair Color and Gender
 Chi-square statistics compute deviations between
what is expected (under H0) and what is actually
observed in the data:
 
2
x
O  E 
2
E
 DF = (r – 1)(c – 1)
where r is number of
rows and c is
number of columns
Hair Color and Gender
 Does it appear that this type of sample could have
come from a population where the different hair
colors occur with the same frequency within each
gender?
 OR does it appear that the distribution of hair color
is different between men and women?
Male
Female
Total
Black
Brown
Blonde
Red
Total
32 (32%)
43 (43%)
16 (16%)
9 (9%)
100
64 (32%)
16 (8%)
200
80
25
N = 300
55 (27.5%) 65 (32.5%)
87
108
Hair Color and Gender
Male
Female
Total
Black
Brown
Blonde
Red
Total
32 (32%)
43 (43%)
16 (16%)
9 (9%)
100
64 (32%)
16 (8%)
200
80
25
N = 300
55 (27.5%) 65 (32.5%)
87
108
32  7.815
 Conclusion: Reject H0: Gender and Hair Color are
independent. It appears that the researcher’s
hypothesis that the population phenotypic ratio is
different between genders is correct (p = 0.029).
Inferences on Proportions
 Special case: when you have a 2X2 contingency
table, you are actually testing a hypothesis
concerning two population proportions: H0: π1 = π2
(i.e., the proportion of males who are blonde is the
same as the proportion of females who are
blonde).
Blonde
Non-blonde
Total
Male
16 (16%)
84 (84%)
100
Female
64 (32%)
136 (68%)
200
Total
80 (26.7%)
220 (73.3%)
N = 300
Inferences on Proportions
 When you have a single proportion and have a
small sample, substitute the Binomial test which
provides exact results.
 The nonparametric Fisher Exact test can be
always be used in place of the chi-square test
when you have contingency table-like data (i.e.,
two categorical factors whose association is of
interest)—it should be substituted for the chisquare test of independence when ‘cell’ sizes are
small.
Next Time
 Linear Regression and Correlation
 Survival Analysis
 Final Thoughts
Download