DOEshortcourse - LISA

advertisement
BASICS OF
DESIGNING
EXPERIMENTS
Thursday, October 24, 2013
5:00pm - 7:00pm
GLC Room G
About Me
• Graduate student in Virginia Tech Department of
Statistics
• Enrolled in Master’s program
• Expected Graduation Date: December 2013
• Future: Job in industry
• Lead Collaborator in LISA
• On-campus consulting group, led by Dr. Eric Vance and Dr.
Chris Franck, with administrative specialist Tonya Pruitt
About
• What?
• Laboratory for Interdisciplinary Statistical Analysis
• Why?
• Mission: to provide statistical advice, analysis, and education to
Virginia Tech researchers
• How?
• Collaboration requests, Walk-in Consulting, Short Courses
• Where?
• Walk-in Consulting in GLC and various other locations
• Collaboration meetings typically held in Sandy 312
• Who?
• Graduate students and faculty members in VT statistics department
Requesting a LISA Meeting
• Go to www.lisa.stat.vt.edu
• Click link for “Collaboration Request Form”
• Sign into the website using VT PID and password
• Enter your information (email, college, etc.)
• Describe your project (project title, research goals,
specific research questions, if you have already collected
data, special requests, etc.)
• Contact assigned LISA collaborators as soon as possible
to schedule a meeting
Agenda
• Introduction to Designing Experiments
• 3 Main Principles
• Randomization
• Replication
• Blocking (Local Control of Error)
• EX: Nozzle design and water jet
performance
• EX: Treatment and leukemia cell gene
expression
• Factorial experiments
INTRODUCTION
to Experimental Design
Why is Experimental Design important?
• MAXIMIZE…
• Probability of having a successful experiment
• Information gain from results of an experiment
• MINIMIZE…
• Unwanted effects from other sources of
variation
• Cost of experiment if resources are limited
Experiment vs. Observational
• OBSERVATIONAL STUDY
• Researcher observes the response of interest
under natural conditions
• EX: Surveys, weather patterns
• EXPERIMENT
• Researcher controls variables that have a potential
effect on the response of interest
Which one helps establish cause-and-effect
relationships better?
Correlation ≠ Causation
EXAMPLE: Impact of Exercise Intensity on
Resting Heart Rate
• Researcher surveys a sample of individuals to
glean information about their intensity of
exercise each week and their resting heart rate
• What type of study is this?
Reported Intensity of
Exercise each week
Ptp 1
Ptp 2
Ptp 3
…
Resting Heart Rate
EXAMPLE: Impact of Exercise Intensity on
Resting Heart Rate
• Researcher finds a sample of individuals,
enrolls groups in exercise programs of different
intensity levels, and then measures before/after
heart rates
Treatment
Ptp 1
Ptp 2
Ptp 3
…
…
Baseline
RHR
Post Program
RHR
EXAMPLE: Impact of Exercise Intensity on
Resting Heart Rate
• What are some factors the experimental
study can account for that the observational
study cannot?
Sources of variation
• Sources of variation are anything that
could cause an observation to be different
from another observation
• What are some reasons that
measurements of resting heart rate could
differ from person to person?
Sources of variation
• There are two main types:
• Gender and age are what are known as
nuisance factors: we are not interested in their
effects on RHR, but they are hard to control
• What we are interested in is the effect of the
intensity of exercise: this source is known as a
treatment factor
Sources of variation
• Good rule of thumb: list major and minor
sources of variation before collecting data
• We want our design to minimize the impact of
minor sources of variation, and to be able to
separate effects of nuisance factors from
treatment factors
• We want the majority of the variability of the
data to be explained by the treatment factors
Designing the experiment:
The Bare Minimum
• Response: Resting heart rate (beats per
minute)
• Treatment: Exercise Program
• Low intensity
• Moderate Intensity
• High Intensity
Designing the experiment:
The Bare Minimum
• Some assumptions
• We will be monitoring the participants’ diet and exercise
throughout the study (not relying on self-reporting)
• We will only enroll participants with high (i.e. unhealthy)
resting heart rates so that there is ample room for
improvement
• Participants’ resting heart rate is all measured in the
same manner, at the same time (upon waking up)
Designing the experiment:
The Bare Minimum
• Basic Design
• 36 participants: 24 males, 12 females
• Every person is assigned to one of the three 8-
week exercise programs
• Resting heart rate is measured at the beginning
and end of the 8 weeks
What other considerations should we make
in designing the experiment?
THREE BASIC PRINCIPLES
OF DOE: Randomization
Randomization
• What?
• Random assignment of experimental treatments and
order of runs
• Why?
• Often we assume an independent, random distribution
of observations and errors – randomization validates
this assumption
• Averages out the effects of extraneous/lurking variables
• Reduces bias and accusations of bias
• How?
• Depends on the type of experiment
Exercise Example
• 36 participants are randomly assigned
to one of the three programs
• 12 in low intensity, 12 in moderate intensity,
12 in high intensity
• Like drawing names from a hat to fall into
each group
• Oftentimes computer programs can
randomize participants for an experiment
Exercise Example
•
What if we did not randomize?
•
Suppose there is some reason behind who comes to
volunteer for the study first versus later
• If we assigned first third to one intensity, second third to
another, and so forth, it would be hard to separate the
effects of the “early volunteers” and their assigned intensity
level
Run
1
2
3
4
5
6
7
8
…
EX1
1
1
1
1
1
1
1
1
…
EX2
1
3
2
3
1
2
1
3
…
Completely Randomized Design (CRD)
•
•
What we just came up with is called a completely
randomized design
Note that in our case, treatments were assigned
randomly, but in some experiments where there are
a sequence of runs performed, the order of runs
need to be randomized as well
Summary
• Randomizing the assignment of treatments
and/or order of runs accounts for known and
unknown differences between subjects
• It does not matter if what occurs does not
“looks random” (i.e. appears to have some
pattern), as long as the order was generated
using a proper randomization device
THREE BASIC PRINCIPLES
OF DOE: Replication
Replication
• What?
• Independent repeat runs of each treatment
• Why?
• Improves precision of effect estimation
• Allows for estimation of error variation and
background noise
• Check against aberrant results that could result in
misleading conclusions
• EX: One person for each treatment. What could go
wrong?
Experimental Units (EUs)
• We now introduce the term “Experimental Unit” (EU)
• EU is the “material” to which treatment factors are assigned
• In our case, each person is an EU
• This is different from an “Observational Unit” (OU)
• OU is part of an EU that is measured
• Multiple OUs within an EU here would be if we took each
person’s pulse at his/her neck, at the wrist, etc. and reported
these observations
Replication Extension to EU
• Thus, a treatment is only replicated if it is assigned to
a new EU
• Taking multiple observations on one EU (i.e. creating
more OUs) does not count as replication – this is
known as subsampling
• Note that treating subsampling as replicating increases the
chance of incorrect conclusions (psuedoreplication)
• Variability in multiple measurements is measurement error,
rather than experimental error
PTP
1
2
3
4
5
6
7
8
…
Wrist RHR
80
69
93
88
77
89
74
79
…
Neck RHR
84
65
92
86
81
86
77
74
…
Consequences of Pseudoreplication
• Is it bad to take multiple OUs on each EU then?
• No, often the solution here is to average the measurements of
from the OUs and treat it as one observation
• What if we don’t do this?
• We severely underestimate error
• We potentially overexaggerate the true treatment differences
• What if measurement error is high?
• Try to improve measurement process
• Revisit the experiment and assess the homogeneity of the
EUs, thinking of potential covariates
Exercise Example
• Use formula:
# 𝑬𝑼𝒔
# 𝑹𝒆𝒑𝒔 =
# π‘»π’“π’†π’‚π’•π’Žπ’†π’π’•π’”
• 36 participants, 3 treatments
• οƒ  36/3 = 12 replications per treatment in the
balanced case
• The balanced case is preferred because:
•
Power of test to detect a significant effect of our treatment
on the response is maximized with equal sample size
Exercise Example
• Unbalanced consequences?
• Suppose the following:
Treatment
Low
Moderate
High
# Participants
9 reps
9 reps
18 reps
• This would lead to better estimation of the high
intensity treatment over the other two
• Thus if you have equal interest in estimating the
treatments, try to equally replicate the number of
treatment assignments
Summary
• The number of replications is the number of
experimental units to which a treatment is
assigned
• Replicating in an experiment helps us
decrease variance and increase precision in
estimating treatment effects
THREE BASIC PRINCIPLES
OF DOE: Blocking
(or Local Control of Error)
Local Control of Error
• What?
• Any means of improving accuracy of measuring
treatment effects in design
• Why?
• Removes sources of nuisance experimental
variability
• Improves precision with which comparisons among
factors are made
• How?
• Often through use of blocking (or ANCOVA)
Blocking
• What?
• A block is a set of relatively homogeneous
experimental conditions
• EX: block on time, proximity of experimental units, or
characteristics of experimental units
• How?
• Separate randomizations for each block
• Account for differences in blocks and then compare
the treatments
Exercise Example
• Block on gender?
• This assumes that males and females have different
responses to exercise intensity
• Would have the following (balanced) design:
BLOCK 1
24 MALES
BLOCK 2
12 FEMALES
8 low
4 low
8 moderate
4 moderate
8 high
4 high
• Here, after the participants are blocked into male/female
groups, they are then randomly assigned into one of
three treatment conditions
Exercise Example
• Block on age?
• This assumes that age may influence the effect exercise
intensity has on resting heart rate
• Would have the following (balanced) design:
BLOCK 1
18-24 years (24 ptps)
BLOCK 2
24-35 years (6 ptps)
BLOCK 3
35-50 years (6 ptps)
8 low
2 low
2 low
8 moderate
2 moderate
2 moderate
8 high
2 high
2 high
• Here, after the participants are blocked into respective
age groups, they are then randomly assigned into one
of three treatment conditions
Randomized Complete Block Design
(RCBD)
• This design is called Generalized RCBD
• Generalized merely means there are replications
involved
• Here, each treatment appears in each block an equal
number of times
• Benefits of RCBD
• We can compare the performance of the three
treatments (exercise programs)
• We can account for the variability in gender that might
otherwise obscure the treatment effects
Summary
• Blocking is separating EUs into groups with
similar characteristics
• It allows us to remove a source of nuisance
variability, and increase our ability to detect
treatment differences
• Randomization is conducted within each block
Note that we cannot make causal inferences
about blocks– only treatment effects!
EXAMPLE: Gene Expression
in Leukemia Cells
Leukemia Cells Background
• Suppose we are interested in how different treatment
groups affect gene expression in human leukemia
cells
• There are three treatment groups:
• MP only
• MP with low dose MTX
• MP with high dose MTX
• Each treatment group has 10 obs
What type of design is this?
CRD Assumptions and Background
• The simplest design assumes that all the EUs
are similar and the only major source of
variation is the treatments
• Recall: A CRD randomizes all treatment-EU
assignments for the specified number of treatment
replications
• Recall: We want to aim to have a balanced
experiment, i.e. equal replications of each
treatment
Leukemia Cells
• As before, we want to randomize which subjects
receive which of the three treatments
• The data looks as follows:
Treatments
Observations
MP ONLY
334.5 31.6
701
MP + HDMTX
919.4 404.2 1024
MP + LDMTX
108.4 26.1
41.2
61.2
69.6
54.1
62.8
671.6 882.1 354.2 321.9 91.1
240.8 191.1 69.7
67.5
242.8 62.7
66.6
120.7 881.9
396.9 23.6
290.4
Leukemia Cells – Pre randomization
These EUs
should be
similar
MP only
MP + LDMTX
MP + HDMTX
Leukemia Cells – Post randomization
Leukemia Cells in JMP
• We want to enter the data such that each
response has its own row, with the
corresponding treatment type
• We then choose Analyze οƒ  Fit Y by X
Leukemia Cells in JMP
• Choose
“GeneExp” for
Y, Response
• Choose
“Treatment” for
X, factor
Leukemia Cells Visual Analysis
What do you see
from this graph (to
the left) here?
• General comments
• Treatment 3 has a smaller spread of data than the other two
• Treatment 2 has the highest average “gene expression”,
followed by Treatment 1, then Treatment 3
• Are the differences substantial?
Leukemia Cells Summary of Fit
• R-square is a measure of fit.
• If it is close to 1, a good model
is indicated.
• If it is close to 0, a poor model
is indicated
• In more technical terms, it is the percent of variation
in response (gene expression) that can be explained
by our predictor (treatment group).
Based on this first glance at the summary of fit,
what would you conclude?
Leukemia Cells ANOVA
• Null hypothesis: The treatments have the same means
• Test: Is there at least one treatment effect that is
different from the rest?
SStotal=SStrt + SSError
Variance of all
observations
from the mean of
all the data
Variance of
treatment
means from
overall mean
Variance of
observations from
their respective
treatment means
Leukemia Cells ANOVA
SSTotal is the
variance
visualized from
this plot
Each of these groups
has its own mean.
SSTrt compares these
means to the overall
mean
SSError compares
each observation
to the treatment
means
Leukemia Cells: ANOVA
• If the treatments have a similar effect, then SSTrt will be
small (since treatment means are close to overall mean)
• If the treatments are different, then SSTrt will be large
(since more of SSTotal comes from SSTrt, i.e. treatment
differences are explaining the variance)
• ANOVA Table calculates these values and gives us a test
statistic (F Ratio) to test for treatment effects
Leukemia Cells: ANOVA
• Under our null hypothesis, F= MSTrt/MSError follows an
F-distribution; from this we obtain our p-value
• Here Prob > F = 0.0544, which is just over the typical
α=0.05 cutoff
Summary of Leukemia Example
• Our ANOVA test failed to reject the null hypothesis that
the treatment means are the same (p-value =0.0544)
• It seems that although the treatment means appeared to
be very different (237.58, 478.62, and 165.20 for
treatments 1, 2, and 3 respectively), the variation of
observations from their respective treatment means was
so large that not enough of the variation in SSTotal could
be attributed to treatment differences
Take a 10 minute break!
EXAMPLE: Nozzle Designs
and Shape Factor
Nozzles & Shapes Background
• Suppose we are interested in how nozzle design (5
types here) affects the shape factor in the
performance of turbulent water jets
• However, the jet efflux velocity has been known to
influence the shape factor in a way that is hard to
control.
• What is this called?
• What can we do to account
for this source of variation?
Nozzles & Shapes Runs
• Suppose we only have five nozzles total, one of each
type of design. Here is a case where we would
randomize run order (rather than treatment)
Jet Efflux Velocity (m/s)
Nozzle
Design
Run
Order
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
2
3
2
1
5
4
3
5
1
4
4
1
1
2
4
5
1
3
4
1
5
3
2
2
5
4
3
2
3
5
Nozzles & Shapes Data
• The data looks as follows:
Nozzle
Design
Jet Efflux Velocity (m/s)
11.73
14.37
16.59
20.43
23.46
28.74
1
0.78
0.80
0.81
0.75
0.77
0.78
2
0.85
0.85
0.92
0.86
0.81
0.83
3
0.93
0.92
0.95
0.89
0.89
0.83
4
1.14
0.97
0.98
0.88
0.86
0.83
5
0.97
0.86
0.78
0.76
0.76
0.75
• How many replicates do we have per treatment?
• What type of design is this?
RCBD Background
• Given t treatments and b blocks, a RCBD has one
observation per treatment in each block
• If we have multiple observations per treatment in each
block (replicates), this is a generalized RCBD
• Is our nozzle example a RCBD or GRCBD?
• In a (G)RCBD,
• Blocks represent a restriction on randomization
• We want to randomize treatment order within each block
We are essentially running separate CRDs
for each block!
Nozzles & Shapes in JMP
• To analyze in JMP, we want
to enter the data such that
each response is lined up
in a different row, with its
associated characteristics
in the same row
Note: make sure Nozzle
Design and Jet Efflux are
listed as nominal variables!
Again, choose Analyze
οƒ  Fit Y by X
Nozzles & Shapes in JMP
• Choose “Shape
Factor” for Y,
Response
• Choose “Nozzle
Design” for X,
factor
• Choose “Jet
Efflux” for Block
Nozzles & Shapes Visual Analysis
• Always look for
a visual pattern
first. What do
you see in this
graph of shape
factor against
nozzle design?
It appears that nozzle design 4 has the highest shape
factor, followed by design 3, design 2, design 5, then
design 1.
Nozzles & Shapes Analysis
• From our earlier
discussion on R-square
values and ANOVA tests,
what is your first intuition
here?
Nozzles & Shapes ANOVA
• The p-value of Nozzle Design is significant. What does
that mean?
• We have an additional Sum of Squares here. What is
it? Are we interested in its effects?
• The p-value of Jet Efflux is significant. What does that
mean?
Nozzles & Shapes ANOVA
• The p-value of Jet Efflux indicates how much we
reduced experimental error οƒ  this means that blocking
was a good idea!
• Can we do a CRD analysis if we find out blocking was
a bad idea? (i.e. p-value of Jet Efflux is high?)
• No. Because we did not design the experiment using CRD
protocol we cannot conduct the analysis this way.
• What do you think our next steps should be?
Contrasts
• Given v treatments and the treatment means τ1…τv :
• Note: Here, we have 5 treatments, so we would just
have our three treatment means τ1, τ2 ,τ3, τ4 and τ5
• A contrast is a specific linear combination of these
means
• For example, if we were comparing treatments 1 and
2, we would have contrast :
(1) τ1 + (-1) τ2 + (0) τ3 + (0)τ4 + (0) τ5
τ1 - τ2
Contrasts
• The most important contrasts include:
• Pairwise treatment comparisons
• Group average comparisons
Nozzles & Shapes Means Comparison
• ANOVA only tells us if there is at least one pair of nozzle
means that differ οƒ  conduct pairwise comparisons
τ1 – τ2
τ1 – τ3
τ1 – τ4
τ1 – τ5
τ2 – τ3
τ2 – τ4
τ2 – τ5
τ3 – τ4
τ3 – τ5
τ4 – τ5
Nozzles & Shapes Means Comparison
• We find Nozzle 4 has
a higher shape factor
than Nozzles 5 and 1
• Nozzle 3 has a higher
shape factor than only
Nozzle 1
Summary of Nozzles & Shapes
• In this case, blocking was key in reducing
experimental error, allowing us to better distinguish
whether at least one of the nozzle designs differed
from another (ANOVA Test)
• This means differences in jet efflux velocity were
causing significant variation in shape factor responses
• Tukey’s pairwise comparisons test allowed us to see
which specific nozzle designs differed. We found that:
Shapenozz4 > Shapenozz1, Shapenozz5
Shapenozz3 > Shapenozz1
Introduction to:
Factorial Designs
CRD Extension: More than one factor
• Suppose we have two or more factors, each with 2+
levels/settings, that we want to investigate to see how
they affect the response
• What are some ways we can conduct an experiment?
• “Best guess” experiments: researchers have practical and
theoretical knowledge they use to “set levels”
• OFAT experiments: vary each factor individually while
holding other factors constant at baseline levels
• Factorial experiments: Factors are varied together, and
response of interest is observed at each combination of
levels
What are the PROs and CONs of each method?
CRD Extension: More than one factor
• Best Guess Experiments
• PRO: experimenters have a good idea of what might work
• CON: Can lead to guessing for a long time without
guarantee of success
• OFAT Experiments
• PRO: easy to interpret, and used extensively in practice
• CON: Can be inefficient, may not reach optimum solution,
and fails to consider interactions (will discuss later)
• Factorial Experiment
• PRO: Efficient, can detect interactions οƒ 
CRD Extension: Factorial Experiments
• In a factorial experiment, treatments are a combination of
multiple factors with different levels (i.e. settings)
• There can be as few as two (common) and as many as
desired (though this severely complicates the design)
• EX: In the Leukemia example, we could alter the
experiment to low and high doses of MP and MTX so that
there are now four “treatments”
MTX level
MP level
Low
High
Low
High
Leukemia Cells – Factorial Design
MP Low
MTX Low
MP High
MTX Low
MP Low
MTX High
MP High
MTX High
Remember to still randomize these treatments
across participants!
Leukemia Cells – Factorial Design
• Data is collected as follows:
Factor A Factor B
(MP)
(MTX)
Treatment
Combo
Rep
I
Rep
II
Rep
III
88.6
122
91.2
-
-
A low, B low
+
-
A high, B low
145.2
171.8
178.9
-
+
A low, B high
163
200.2
169.3
+
+
A high, B high
460.4
492.3
483.1
How do we analyze a Factorial Experiment?
Factorial ANOVA
• Given two factors, A and B, with varying number of
levels, what do we want to examine to see how A and
B affect the response?
• Overall mean (of all the data)
• Cell Means (mean for each treatment combo)
• Factor A and B level means
• We use the same ANOVA approach, but further
decompose SSTrt into pieces for different factors
SSTrt=SSA+SSB+SSAB
Factorial ANOVA
• Visualize in contingency table:
MTX level
MP level
Low
High
Low
Cell mean
Cell mean
MP Low Factor
Mean
High
Cell mean
Cell mean
MP High Factor
Mean
MTX Low
Factor Mean
MTX High
Factor Mean
Overall Mean
Factorial ANOVA
• Visualize in contingency table:
MTX level
MP level
Low
High
Low
100.6
177.5
139.05
High
165.3
478.6
321.95
132.95
328.05
230.5
Factorial ANOVA
• Let’s break down SSTrt into its respective pieces:
SSTrt = SSA + SSB + SSAB
• SSTrt: Compares cell means to overall mean
• SSA: Compares A level means to overall mean
• SSB: Compares B level means to overall mean
• SSA and SSB test for main effects of factors A and B
• Main effect: average effect of changing from one level
of the factor to another, averaging over all levels of the
other factors
Factorial ANOVA
• Let’s break down SSTrt into its respective pieces:
SSTrt = SSA + SSB + SSAB
• SSAB: Tests for interaction between A and B
• Interaction: When how factor A affects the
response depends on the level of factor B
Interaction
• How to determine an interaction?
• Look at behavior of the means as the levels vary
Main Effects & Interaction
• Main effects and interactions are specific types of
important contrasts
• Recall from our discussion of contrasts that group
average contrasts are common:
• Let’s suppose the treatment means are as such:
MTX level
MP level
Low
High
Low
τ1
τ2
High
τ3
τ4
Main Effects & Interaction
MTX level
MP level
Low
High
Low
τ1
τ2
High
τ3
τ4
Main Effects
MP:
½ (τ3 + τ4 – τ1 – τ2)
Interaction
MP*MTX
½ (τ4 + τ1 – τ2 – τ3)
MTX: ½ (τ2 + τ4 – τ1 – τ3)
Factorial Design: Summary
• One treatment is combination of multiple factors
• Efficient way to test effect of multiple treatment
factors
• We may extend to more than two factors, but the
number of EUs necessarily grows rapidly!
• Use an interaction plot to help visualize effects
• Main effects and interactions can be represented
through group average contrasts
Wrap-Up: Conclusions &
Questions
Summary of the Short Course
• Remember to randomize!
• Randomize run order, and treatments
• Remember to replicate!
• Use multiple EUs for each treatment– it will help you be more
accurate in estimating your effects
• Remember to block!
• In the case where you suspect some inherent quality of your
experimental units may be causing variation in your response,
arrange your experimental units into groups based on similarity
in that quality
• Remember to contact LISA!
• For short questions, attend our Walk-in Consulting hours
• For research, come before you collect your data for design help
The End!
References
• Cheok, M. H., Yang, W., Pui, C. H., Downing, J. R., Cheng, C., Naeve, C. W.,
•
•
•
•
•
•
•
•
•
•
. . . Evans, W. E. (2003). Treatment-specific changes in gene expression
discriminate in vivo drug response in human leukemia cells. Nature Genetics,
34(1).
Theobald, C. (1981). The effect of Nozzle design on the stability and
performance of turbulent water jets. Fire Safety Journal, 4(1).
http://www.floiminter.net/?page_id=286
http://io9.com/5928595/researchers-identify-the-kinds-of-exercise-that-helpyou-live-longer
http://www.webmd.com/heart/taking-a-pulse-heart-rate
http://www.docofdiets.com/exercise-3.htm
http://en.wikipedia.org/wiki/Leukemia
http://www.wash-safe.com/wash_safe_en/bulls-eye-garden-hose-nozzle
http://www.rgbstock.com/photo/n3eIwlU/
http://www.operationsports.com/forums/madden-nfl-football-sliders/573238m13-killer-sliders-offline-ccm-guaranteed-cpu-run-game-92.html
http://www.123rf.com/photo_4819916_heart-rate-vector-design-element.html
Download