Uploaded by Rodney Nxumalo

EDnotes2022

advertisement
University of Cape Town
Department of Statistical Sciences
Design of Experiments
Course Notes for STA2005S
J. Juritz, F. Little, B. Erni
3
1
5
2
4
5
2
1
4
3
1
5
4
3
2
August 31, 2022
2
4
3
5
1
4
3
2
1
5
sta2005s: design and analysis of experiments
2
Contents
1
Introduction
8
1.1 Observational and Experimental Studies
1.2 Definitions
8
9
1.3 Why do we need experimental design?
10
1.4 Replication, Randomisation and Blocking
1.5 Experimental Design
11
12
Treatment Structure
12
Blocking structure
14
Three basic designs
15
1.6 Design for Observational Studies / Sampling Designs
1.7 Methods of Randomisation
16
17
Using random numbers for a CRD (completely randomised design)
Using random numbers for a RBD
Randomising time or order
17
18
1.8 Summary: How to design an experiment
18
17
sta2005s: design and analysis of experiments
2
Single Factor Experiments: Three Basic Designs
2.1 The Completely Randomised Design (CRD)
2.2 Randomised Block Design (RBD)
2.3 The Latin Square Design
Randomisation
2.4 An Example
3
4
20
20
20
21
21
22
The Linear Model for Single-Factor Completely Randomized Design Experiments
3.1 The ANOVA linear model
24
3.2 Least Squares Parameter Estimates
Linear model in matrix form
Speed-reading Example
24
26
28
The effect of different constraints on the solution to the normal equations
Design matrices of less than full rank
32
Parameter estimates for the single-factor completely randomised design
3.3 Standard Errors and Confidence Intervals
38
Important estimates and their standard errors
Speed-reading Example
39
3.4 Analysis of Variance (ANOVA)
40
Decomposition of Sums of Squares
40
Distributions for Sums of Squares
41
F test of H0 : α1 = α2 = · · · = α a = 0
ANOVA table
Some Notes
44
44
Speed-reading Example
46
30
43
39
32
24
sta2005s: design and analysis of experiments
3.5 Randomization Test for H0 : α1 = . . . = α a = 0
47
3.6 A Likelihood Ratio Test for H0 : α1 = . . . = α a = 0
3.7 Kruskal-Wallis Test
4
50
53
Comparing Means: Contrasts and Multiple Comparisons
4.1 Contrasts
56
Point estimates of Contrasts and their variances
4.2 Orthogonal Contrasts
56
58
Calculating sums of squares for orthogonal contrasts
4.3 Multiple Comparisons: The Problem
60
63
4.4 To control or not to control the experiment-wise Type I error rate
Exploratory vs confirmatory studies
4.5 Bonferroni, Tukey and Scheffé
Bonferroni Correction
Tukey’s Method
Scheffé’s Method
65
67
67
69
70
Example: Strength of Welds
71
4.6 Multiple Comparison Procedures: The Practical Solution
4.7 Summary
77
4.8 Orthogonal Polynomials
4.9 References
5
77
81
Randomised Block and Latin Square Designs
Model
84
Sums of Squares and ANOVA
Example
55
88
86
83
75
64
5
sta2005s: design and analysis of experiments
5.1 The Analysis of the RBD
90
Estimation of µ, αi (i = 1, 2, . . . a) and β j (j = 1, 2 . . . b)
Analysis of Variance for the Randomised Block Design
Example: Timing of Nitrogen Fertilization for Wheat
5.2 Missing values – unbalanced data
Other uses of Latin Squares
Missing Values
104
105
106
Example: Rocket Propellant
106
Blocking for 3 factors - Graeco-Latin Squares
108
Power and Sample Size in Experimental Design
109
109
6.2 Two-way ANOVA model
Factorial Experiments
7.1 Introduction
99
103
Test of the hypothesis H0 : γ1 = γ2 = . . . = γ p = 0
7
95
102
Model for a Latin Square Design Experiment
6.1 Introduction
93
101
5.5 The Latin Square Design
6
91
96
5.3 Randomization Tests for Randomized Block Designs
5.4 Friedman Test
6
111
112
112
7.2 Basic Definitions
113
7.3 Design of Factorial Experiments
116
Example: Effect of selling price and type of promotional campaign on number of items sold
Interactions
Sums of Squares
117
119
116
sta2005s: design and analysis of experiments
7.4 The Design of Factorial Experiments
Replication and Randomisation
120
120
Why are factorial experiments better than experimenting with one factor at a time?
Performing Factorial Experiments
7.5 Interaction
122
123
7.6 Interpretation of results of factorial experiments
7.7 Analysis of a 2-factor experiment
7.8 Testing Hypotheses
124
126
128
7.9 Power analysis and sample size:
131
7.10 Multiple Comparisons for Factorial Experiments
7.11 Higher Way Layouts
7.12 Examples
8
132
132
133
Some other experimental designs and their models
8.1 Fixed and Random effects
137
8.2 The Random Effects Model
139
Testing H0 : σa2 = 0 versus Ha : σa2 > 0
Variance Components
An ice cream experiment
8.3 Nested Designs
142
143
144
Calculation of the ANOVA table
146
Estimates of parameters of the nested design
8.4 Repeated Measures
140
149
149
137
121
7
1
Introduction
1.1 Observational and Experimental Studies
There are two fundamental ways to obtain information in research:
by observation or by experimentation. In an observational study the
observer watches and records information about the subject of
interest. In an experiment the experimenter actively manipulates the
variables believed to affect the response. Contrast the two great
branches of science, Astronomy in which the universe is observed by
the astronomer, and Physics, where knowledge is gained through
the physicist changing the conditions under which the phenomena
are observed.
In the biological world, an ecologist may record the plant species
that grow in a certain area and also the rainfall and the soil type,
then relate the condition of the plants to the rainfall and soil type.
This is an observational study. Contrast this with an experimental
study in which the biologist grows the plants in a greenhouse in
various soils and with differing amounts of water. He decides on
the conditions under which the plants are grown and observes the
effect of this manipulation of conditions on the response.
Both observational and experimental studies give us information
about the world around us, but it is only by experimentation that
we can infer causality; at least it is a lot more difficult to infer causal
relationships with data from observational studies. In a carefully
planned experiment, if a change in variable A, say, results in a
change in the response Y, then we can be sure that A caused this
change, because all other factors were controlled and held constant.
In an observational study if we note that as variable A changes Y
changes, we can say that A is associated with a change in Y but we
cannot be certain that A itself was the cause of the change.
Both observational and experimental studies need careful planning to be effective. In this course we concentrate on the design of
experimental studies.
sta2005s: design and analysis of experiments
Clinical Trials1 : Historically, medical advances were based on
anecdotal data; a doctor would examine six patients and from this
wrote a paper and publish it. Medical practitioners started to become aware of the biases resulting from these kinds of anecdotal
studies. They started to develop the randomized double-blind clinical trial, which has become the gold standard for approval of any
new product, medical device, or procedure.
9
1
https://en.wikipedia.org/wiki/
Clinical_trial
1.2 Definitions
For the most part the experiments we consider aim to compare the
effects of a number of treatments. The treatments are carefully chosen and controlled by the experimenter.
1. The factors/variables that are investigated, controlled, manipulated in the experiment, are called treatment factors. Usually,
for each treatment factor, the experimenter chooses some specific levels of interest, e.g. the factor ‘water level’ can have levels
1cm, 5cm, 10cm; the factor ‘background music‘ can have levels
‘Classical’, ‘Jazz‘, ’Silence’.
2. In a single-factor experiment2 the treatments will correspond to
the levels of the treatment factor (e.g. for the water level experiment the treatments will be 1cm, 5cm, 10cm).
With more than one treatment factor, the treatments can be constructed by crossing all factors: every possible combination of
the levels of factor A and the levels of factor B is a treatment.
Experiments with crossed treatment factors are called factorial experiments3 . More rarely in true experiments, factors can be nested
(see Section 1.5).
3. Experimental unit: this is the entity to which a treatment is assigned. The experimental unit may differ from the observational
or sampling unit, which is the entity from which a measurement
is taken. For example, one may apply the treatment of ‘high
temperature and low water level’ to a pot of plants containing 5
individual plants. Then we measure the growth of each of these
plants. The experimental unit is the pot (this is where the treatment is applied), the observational units are the plants (this is
the unit on which the measurement is taken). This distinction
is very important because it is the experimental units which determine the (experimental) error variance, not the observational
units. This is because we are interested in what happens if we
independently repeat the treatment. For comparing treatments, we
obtain one (response) value per pot (average growth in the pot),
one value per experimental unit.
4. Experimental units which are roughly similar prior to the experiment, are said to be homogeneous. The more homogeneous the
Single-factor experiment: only a
single treatment factor is under investigation.
2
Factorial experiments are experiments with at least two treatment
factors, and the treatments are constructed as the different combinations
of the levels of the individual treatment factors.
3
Figure 1.1: Example of a 2 × 2 ×
2 = 23 factorial experiment with
three treatment factors (A, B and
C), each with two levels (low and
high, coded as - and +), resulting in
eight treatments (vertices of cube)
(Montgomery Chapter 6).
sta2005s: design and analysis of experiments
experimental units are the smaller the experimental error variance (variation between observations which have received the
same treatment = variance that cannot be explained by known
factors) will be. It is generally desirable to have homogeneous
experimental units for experiments, because this allows us to
detect the differences between treatments more clearly.
5. If the experimental units are not homogeneous, but heterogeneous, we can group sets of homogeneous experimental units
and thereby account for differences between these groups. This
is called blocking. For example, farms of similar size and in the
same region could be considered homogeneous / more similar
to each other than to farms in a different region. Farms (experimental units) in different regions will differ because of regional
factors such as vegetation and climate. If we suspect that these
differences will affect the response, we should block by region:
similar farms (experimental units) are grouped into blocks.
6. In the experiments we will consider, each experimental unit
receives only one treatment. But each treatment can consist of
a combination of factor levels, e.g. high temperature combined
with low water level can be one treatment, high temperature
with high water level another treatment.
7. The treatments are applied at random to the experimental units
in such a way that each unit is equally likely to receive a given
treatment. The process of assigning treatments to the experimental units in this way is called randomisation.
8. A plan for assigning treatments to experimental units is called an
experimental design.
9. If a treatment is applied independently to more than one experimental unit it is said to be replicated. Treatments must be
replicated! Making more than one observation on the same experimental unit is not replication. If the measurements on the
same experimental unit are taken over time there are methods
for repeated measures, longitudinal data, see Ch.8. If the measurements are all taken at the same time, as in the pot with 5
plants example above, this is just pseudoreplication. Pseudoreplication is a common problem (Hurlbert 1984), and will invalidate
the experiment.
10. We are mainly interested in the effects of the different treatments:
by how much does the response change with treatment i relative
to the overall mean response.
1.3
Why do we need experimental design?
An experiment is almost the only way in which one can control all
factors to such an extent as to eliminate any other possible expla-
10
sta2005s: design and analysis of experiments
11
nation for a change in response other than the treatment factor of
concern. This then allows one to infer causality. To achieve this,
experiments need to adhere to a few important principles discussed
in the next section.
Experiments are frequently used to find optimal levels of settings
(treatment factors) which will maximise (or minimise) the response
(especially in engineering). Such experiments can save enormous
amounts of time and money.
1.4 Replication, Randomisation and Blocking
There are three fundamental principles when planning experiments. These will help to ensure the validity of the analysis and to
increase power:
1. Replication: each treatment must be applied independently to
several experimental units. This ensures that we can separate
error variance from differences between treatments.
True, independent replication demands that the treatment is set
up anew for each experimental unit, one should not set up the
experiment for a specific treatment and then run all experimental
units under that setting at the same time. This would result in
pseudo-replication, where effectively the treatment is applied
only once.
For example, if we were interested in the effect of temperature on
the taste of bread, we should not bake all 180°C loafs in the same
oven at the same time, but prepare and bake each loaf separately.
Else we would not be able to say whether the particular batch or
particular time of day, or oven setting, or the temperature was
responsible for the improved taste.
2. Randomisation: a method for allocating treatments to experimental units which ensures that:
• there is no bias on the part of the experimenter, either conscious or unconscious, in the assignment of the treatments to
the experimental units;
• possible differences between experimental units are equally
distributed amongst the treatments, thereby reducing or eliminating confounding.
Randomisation helps to prevent confounding with underlying,
possibly unknown, variables (e.g. changes over time). Randomisation allows us to assume independence between observations.
Both the allocation of treatments to the experimental material
and the order in which the individual runs or trials of the experiment are to be performed must be randomly determined!
Confounding: When we cannot attribute the change in response to a
specific factor but several factors could
have contributed to this change, we
call this confounding.
sta2005s: design and analysis of experiments
12
3. Blocking refers to the grouping of experimental units into homogeneous sets, called blocks. This can reduce the unexplained (error) variance, resulting in increased power for comparing treatments. Variation in the response may be caused by variation in
the experimental units, or by external factors that might change
systematically over the course of the experiment (e.g. if the experiment is conducted on different days). Such nuisance factors
should be blocked for whenever possible (else randomised).
Examples of factors for which one would block: time, age, sex,
litter of animals, batch of material, spatial location, size of a city.
Blocking also offers the opportunity to test treatments over a
wider range of conditions, e.g. if I only use people of one age
(e.g. students) I cannot generalise my results to older people.
However if I use different blocks (each an age category) I will
be able to tell whether the treatments have similar effects in all
age groups or not. If there are known groups in the experimental
units, blocking guards against unfortunate randomisations.
Blocking aims to reduce (or control) any variation in the experimental material, where possible, with the intention to increase
power (sensitivity)4 .
Another way to reduce error variance is to keep all factors not
of interest as constant as possible. This principle will affect how
experimental material is chosen.
The three principles above are sometimes called the three R’s
of experimental design (randomisation, replication, reducing
unexplained variation).
1.5 Experimental Design
The design that will be chosen for a particular experiment depends
on the treatment structure (determined by the research question)
and the blocking structure (determined by the experimental units
available).
Treatment Structure
Single (treatment) factor experiments are fairly straightforward.
One needs to decide on which levels of the single treatment factor
to choose. If the treatment factor is continuous, e.g. temperature, it
may be wise to choose equally spaced levels, e.g. 50, 100, 150, 200.
This will simplify analysis when you want to fit a polynomial curve
through this, i.e. investigate the form of the relationship between
temperature and the response.
If there is more than one treatment factor, these can be crossed,
giving rise to a factorial experiment, or nested.
When we later fit our model (a
special type of regression model), we
will add the blocking variable and
hope that it will explain some of the
total variation in our response.
4
sta2005s: design and analysis of experiments
Often, factorial experiments are illustrated by a graph such as
shown in Figure 1.2. This quickly summarizes which factors, factor
levels and which combinations are used in an experiment.
One important advantage of factorial experiments over onefactor-at-a-time experiments, is that one can investigate interactions.
If two factors interact, it means that the effect of the one depends
on the level of the other factor, e.g. the change in response when
changing from level a1 to a2 (of factor A) depends on what level
of B is being used. Often, the interesting research questions are
concerned with interaction effects. Interaction plots are very helpful
when trying to understand interactions. As an example, the success
of factor A (with levels a1 and a2) may depend on whether factor
B is present (b2) or absent (b1) (RHS of Figure 1.3). On the LHS of
this figure, the success of A does not depend on the level of B. We
can only explore interactions if we explore both factors in the same
experiment, i.e. use a factorial experiment.
A and B interact
●
b1
●
b2
●
Y
Y
no interaction
●
b1
●
b2
●
●
●
a1
a2
a1
a2
Nested Factors: When factors are nested the levels of one factor,
B, will not be identical across all levels of another factor A. Each
level of factor A will contain different levels of factor B. These designs are common in observational studies; we will briefly look at
their analysis in Chapter 8.
Example of nested factors: In an animal breeding study we could
have two bulls (sires), and six cows (dames). Progeny (offspring) is
nested within dames, and dames are nested within sires.
progeny
dam 1
1 2
sire 1
dam 2
3 4
dam 3
5 6
dam 4
7 8
sire 2
dam 5
9 10
dam 6
11 12
Blinding, Placebos and Controls: A control treatment is often
b2
●
●
●
●
●
●
●
●
●
b1
●
●
●
●
●
●
●
●
●
Factor B
Factorial Experiments: In factorial experiments the total number
of treatments (and experimental units required) increases rapidly,
as each factor level combination is included. For example, if we
have temperature, soil and water level, each with 2 levels there are
2 × 2 × 2 = 8 combinations = 8 treatments.
13
low
medium
high
Factor A
Figure 1.2: One way to illustrate a
3 × 2 factorial experiment. The three
dots at each treatment illustrate three
replicates per treatment.
Figure 1.3: On the left factors A
and B do not interact (their effects
are additive). On the right A and B
interact, the effect of one depends
on the level of the other factor. The
dots represent the mean response
at a certain treatment. The lines join
treatments with the same level of
factor B, for easier reference.
sta2005s: design and analysis of experiments
necessary as a benchmark to evaluate the effectiveness of the actual
treatments. For example, how do two new drugs compare, but also
are they any better than the current drug?
Placebo Effect: The physician’s belief in the treatment and the
patient’s faith in the physician exert a mutually reinforcing effect;
the result is a powerful remedy that is almost guaranteed to produce an improvement and sometimes a cure (Follies and Fallacies
in Medicine, Skrabanek & McCormick). The placebo effect is a measurable, observable or felt improvement in health or behaviour not
attributable to a medication or treatment.
A placebo is a control treatment that looks/tastes/feels exactly
like the real treatment (medical procedure or pill) but with the
active ingredient missing. The difference between the placebo and
treatment group is then only due to the active ingredient and not
affected by the placebo effect.
To measure the placebo effect one can use two control treatments: a placebo and a no-treatment control.
If humans are involved as experimental units or as observers,
psychological effects can creep into the results. In order to pre-empt
this, one should blind either or both observer and experimental unit
to the applied treatment (single- or double-blinded studies). The
experimental unit and / or the observer do not know which treatment was assigned to the experimental unit. Blinding the observer
prevents biased recording of results, because expectations could
consciously or unconsciously influence what is recorded.
Blocking structure
The most important aim of blocking is to reduce unexplained variation (error variance), and thereby to obtain more precise parameter
estimates. Here one should look at the experimental units available:
Are there any structures/differences that need to be blocked? Do
I want to include experimental units of different types to make the
results more general? How many experimental units are available
in each block? For the simplest designs covered in this course, the
number of experimental units in each block will correspond to the
total number of treatments. However, in practice this can often not
be achieved.
The grouping of the experimental units into homogeneous sets
called blocks and the subsequent randomisation of the treatments
to the units in a block form the basis of all experimental designs.
We will study three designs which form the basis of other more
complex designs. They are:
14
sta2005s: design and analysis of experiments
15
Three basic designs
1. Completely Randomised Design
This design is used when the experimental units are all homogeneous. The treatments are randomly assigned to the experimental units.
2. Randomised Block Design
This design is used when the experimental units are not all homogeneous but can be grouped into sets of homogeneous units
called blocks. The treatments are randomly assigned to the units
within each block.
3. Latin Square Design
This design allows blocking for two factors without increasing
the number of experimental units. Each treatment occurs only
once in every row block and once in every column block.
In all of these designs the treatment structure can be a single
factor or factorial (crossed factors).
Completely Randomised Design Example: Longevity of fruiflies depending on sexual activity and thorax length5
125 male fruitflies were divided randomly into 5 groups of 25
each. The response was the longevity of the fruitfly in days. One
group was kept solitary, while another was kept individually with
a virgin female each day. Another group was given 8 virgin females
per day. As an additional control the fourth and fifth groups were
kept with one or eight pregnant females per day. Pregnant fruitflies
will not mate. The thorax length of each male was measured as this
was known to affect longevity.
Randomised Block Design Example: Executives and Risk
Executives were exposed to one of 3 methods of quantifying
the maximum risk premium they would be willing to pay to avoid
Figure 1.4: The three basic designs:
Completely Randomised Design (left),
Randomised Block Design (middle), Latin Square Design (right). In
each design each of five treatments
(colours) is replicated 5 times. Note
how the randomisation was done:
CRD: complete randomisation, RBD:
randomisation of treatments to experimental units in blocks, LSD: each
treatment once in each column, once in
each row. The latter two are forms of
restricted randomisation (as opposed
to complete randomisation).
Sexual Activity and the Lifespan of
Male Fruitflies. L. Partridge and M.
Farquhar. Nature, 1981, 580-581.
The data can be found in R package
faraway (fruitfly).
5
sta2005s: design and analysis of experiments
16
uncertainty in a business decision. The three methods are: 1) U:
utility method, 2) W: worry method, 3) C: comparison method.
After using the assigned method, the subjects were asked to state
their degree of confidence in the method on a scale from 0 (no
confidence) to 20 (highest confidence).
Block
1
2
3
4
5
(oldest executives)
(youngest executives)
Experimental Unit
1
2
3
C W
U
C U
W
U W
C
W U
C
W C
U
Table 1.1: Layout and randomization
for premium risk experiment.
The experimenters blocked for age of the executives. This is a
reasonable thing to do if they expected, for example, lower confidence in older executives, i.e. different response due to inherent
properties of the experimental units (which here are the executives). The blocking factor is age, the treatment factor is the method
of quantifying risk premium, the response is the confidence in
method. The executives in one block are of a similar age. The three
methods were randomly assigned to the three experimental units in
each block.
Latin Square Design Example: Traffic Light Signal Sequences
A traffic engineer conducted a study to compare the total unused
red light time for five different traffic light signal sequences. The
experiment was conducted with a Latin square design in which the
two blocking factors were (1) five randomly selected intersections
and (2) five time periods.
Intersection
1
2
3
4
5
1
15.2 (A)
16.5 (B)
12.1 (C)
10.7 (D)
14.6 (E)
2
33.8 (B)
26.5 (C)
31.4 (D)
34.2 (E)
31.7 (A)
Mean
13.82
31.52
Time Period
3
4
13.5 (C) 27.4 (D)
19.2 (D) 25.8 (E)
17.0 (E) 31.5 (A)
19.5 (A) 27.2 (B)
16.7 (B) 26.3 (C)
17.18
27.64
5
29.1 (E)
22.7 (A)
30.2 (B)
21.6 (C)
23.8 (D)
Mean
23.80
22.14
24.44
22.64
22.62
25.48
Ȳ... = 23.128
1.6 Design for Observational Studies / Sampling Designs
In observational studies, design refers to how the sampling is done
(on the explanatory variables), and is referred to as sampling design.
The aim is, as in experimental studies, to achieve the best possible
estimates of effects. The methods used to analyse data from observational or experimental studies are often the same. The conclusions will differ in that no causality can be inferred in observational
studies.
Table 1.2: Traffic light signal sequences.
The five signal sequence treatments are
shown in parentheses as A, B, C, D, E.
The numerical values are the unused
red light times in minutes.
sta2005s: design and analysis of experiments
1.7 Methods of Randomisation
Randomisation refers to the random allocation of treatments to the
experimental units. This can be done using random number tables or
using a computer or calculator to generate random numbers. When
assigning treatments to experimental units, each permutation must
be equally likely, i.e. each possible assignment of treatments to
experimental units must be equally likely.
Randomisation is crucial for conclusions drawn from the experiment to be correct, unambiguous and defensible!
For completely randomised designs the experimental units are
not blocked, so the treatments (and their replicates) are assigned
completely at random to all experimental units available (hence
completely randomised).
If there are blocks, the randomisation of treatments to experimental units occurs in each block.
In Practical 1 you will learn how to use R for randomisation.
Using random numbers for a CRD (completely randomised design)
This method requires a sequence of random numbers (from a calculator or computer; in the old days printed random number tables
were available). To randomly assign 2 treatments (A and B) to 12
experimental units, 6 experimental units per treatment, you can:
1. Decide to let odd numbers ≡ treatment A, and even numbers ≡
treatment B.
79
A
76
B
49
A
31
A
93
A
54
B
17
A
36
B
91
A
50
B
11
38
B
87
2. or decide to assign treatment A for two-digit numbers 00 - 49,
and treatment B for two-digit numbers 50 - 99.
67
B
49
A
72
B
48
A
95
B
39
A
03
A
22
A
46
A
87
B
71
B
16
A
70
B
Using random numbers for a RBD
Say we wish to randomly assign 12 patients to 2 treatments in 3
blocks of 4 patients each. The different (distinct) orderings of four
patients, two receiving treatment A, two receiving treatment B are:
79
97
69
22
B
17
sta2005s: design and analysis of experiments
A
A
A
B
B
B
A
B
B
A
A
B
B
B
A
A
B
A
B
A
B
B
A
A
to be chosen if random number between
01 - 10
11 - 20
21 - 30
31 - 40
41 - 50
51 - 60
Ignore numbers 61 to 99.
96
09
↓
AABB
↓
block 1
58
↓
BBAA
↓
block 2
89
23
↓
ABAB
↓
block 3
71
38
Coins, cards, pieces of paper drawn from a bag can also be used
for randomisation.
Randomising time or order
To prevent systematic changes over time from influencing results
one must ensure that the order of the treatments over time is random. If a clear time effect is suspected, it might be best to block for
time. In any case, randomisation over time helps to ensure that the
time effect is approximately the same, on average, in each treatment
group, i.e. treatment effects are not confounded with time.
For the same reason one would block spatially arranged experimental units, or if this is not possible, randomise treatments in
space.
1.8 Summary: How to design an experiment
The best design depends on the given situation. To choose an appropriate design, we can start with the following questions:
1. Treatment Structure: What is the research question? What is the
response? What are the treatment factors? What levels for each
treatment factor should I choose? Do I need a control treatment?
Are you interested in interactions?
2. Experimental Units: How many replicates (per treatment) do I
need? How many experimental units can I get, afford?
3. Blocking: Do I need to block the experimental units? Do I need
to control other unwanted sources of variation? Which factors
should be kept constant?
etc
18
sta2005s: design and analysis of experiments
4. Other considerations: ethical, time, cost. Will I have enough
power to find the effects I am interested in?
The treatment structure, blocking factors, and number of replicates required are the most important determinants of the appropriate design. Lastly, we need to randomise treatments to experimental units according to this design.
19
2
Single Factor Experiments: Three Basic Designs
This chapter gives a brief overview of the three basic designs.
2.1 The Completely Randomised Design (CRD)
This design is used when the experimental units are homogeneous.
The experimental units will of course differ, but not so that they
can be split into clear groups, i.e. no blocks seem necessary. This
is before the treatments are applied. Each treatment is randomly
assigned to r experimental units. Each unit is equally likely to
receive any of the a treatments. There are N = r × a experimental
units.
Some advantages of completely randomized designs are:
1. Easy to lay out.
2. Simple analysis even when there are unequal numbers of replicates of some treatments.
3. Maximum degrees of freedom for error.
An example of a CRD with 4 treatments, A, B, C and D randomly applied to 12 homogeneous experimental units:
Units
Treatments
1
B
2
C
3
A
4
A
5
C
6
A
7
B
8
D
9
C
10
D
11
B
2.2 Randomised Block Design (RBD)
This design is used if the experimental material is not homogeneous but can be divided into blocks of homogeneous material.
Before the treatments are applied there are no known differences
12
D
sta2005s: design and analysis of experiments
21
between the units within a block, but there may be very large differences between units from different blocks. Treatments are assigned
at random to units within a block.
In a complete block design each treatment occurs at least once
in each block (randomised complete block design). If there are
not sufficient units within a block to allow all the treatments to be
applied an Incomplete Block Design can be used (not covered here,
see Hicks & Turner (1999) for details).
Randomised block designs are easy to design and analyse. Usually, the number of experimental units in each block is the same
as the number of treatments. Blocking allows more sensitive comparisons of treatment effects. On the other hand, missing data can
cause problems in the analysis.
Any known variability in the experimental procedure or the
experimental units can be controlled for by blocking. A block could
be:
Table 2.1: Example of a randomised
layout for 4 treatments applied in 3
blocks.
Block 1
C
A
B
D
Block 2
B
C
A
D
Block 3
D
B
C
A
• A day’s output of a machine.
• A litter of animals.
• A single subject.
• A single leaf on a plant.
• Time of day or weekday.
2.3 The Latin Square Design
A Latin Square Design allows blocking for two sources of variation,
without having to increase the number of experimental units. Call
these sources, row variation and column variation. The p2 experimental units are grouped by their row and column position. The p
treatments are assigned so that they occur exactly once in each row
and in each column.
Table 2.2: A 4 × 4 Latin Square Design.
C1
C2
C3
C4
R1
A
B
C
D
R2
B
C
D
A
R3
C
D
A
B
R4
D
A
B
C
Randomisation
The Latin Square is chosen at random from the set of standard
Latin squares of order p. Then a random permutation of rows is
chosen, a random permutation of columns is chosen, and finally,
the letters A, B, C, ..., are randomly assigned to the treatments.
Latin square designs are efficient in the number of experimental units used when there are two blocking factors. However, the
number of treatments must equal the number of row blocks and the
Table 2.3: Latin square designs can
be used to block for time periods and
order of presentation of treatments.
Order
Period 1
ABCD
Period 2
BCDA
Period 3
CDAB
Period 4
DABC
sta2005s: design and analysis of experiments
22
number of column blocks. One experimental unit must be available
at every combination of the two blocking factors. Also, the assumption of no interactions between treatment and blocking factors
should hold.
2.4 An Example
This example gives a brief overview of how the chosen design will
affect analysis, and conclusions. The ANOVA tables look similar to
the regression ANOVA tables you are used to, and are interpreted
in the same way. The only difference is that we have a row for each
treatment factor and for each blocking factor.
An experiment is conducted to compare 4 methods of treating
motor car tyres. The treatments (methods), labelled A, B, C and D
are assigned to 16 tyres, four tyres receiving A, four others receiving B, etc.. Four cars are available, treated tyres are placed on each
car and the tread loss after 20 000 km is measured.
Consider design 1 in Table 2.4.
This design is terrible! Apparent treatment differences could also
be car differences: Treatment and car effects are confounded.
We could use a Completely Randomized Design (CRD). We
would assign the treated tyres randomly to the cars hoping that
differences between the cars will average out. Table 2.6 is one such
randomisation.
To test for differences between treatments, an analysis of variance (ANOVA) is used. We will present these tables here, but only
as a demonstration of what happens to the mean squared error
(MSE) when we change the design, or account for variation between
blocks.
Table 2.7 shows the ANOVA table for testing the hypothesis of
no difference between treatments, H0 : µ A = µ B = µC = µ D . There
is no evidence for differences between the tyre brands.
Is the Completely Randomised Design the best we can do? Note
that A is never used on Car 3, and B is never used on Car 1. Any
variation in A may reflect variation in Cars 1, 2 and 4. The same
remarks apply to B and Cars 2, 3 and 4. The error sum of squares
will contain this variation. Can we remove it? Yes - by blocking for
cars.
Even though we randomized, there is still a bit of confounding
(between cars and treatments) left. To remove this problem we
should block for car, and use every treatment once per car, i.e. use
a Randomised Block Design. Differences between the responses to
the treatments within a car will reflect the effect of the treatments.
Table 2.4: Car design 1.
Car 1 Car 2 Car 3
A
B
C
A
B
C
A
B
C
A
B
C
Car 4
D
D
D
D
Table 2.5: Car design 2: Completely
Randomised Design. The numbers in
brackets are the observed tread losses.
Car 1
Car 2
Car 3
Car 4
C(12)
A(14) D(10) A(13)
A(17)
A(13)
C(10)
D(9)
D(13)
B(14)
B(14)
B(8)
D(11)
C(12)
B(13)
C(9)
Table 2.6: Car design 2 (CRD), with
rearranged observations.
Treatment
Tread loss
A
17 14 13 13
B
14 14 13
8
C
12 12 10
9
D
13 11 10
9
Table 2.7: ANOVA table for car design
2, CRD.
Source
df
SS Mean Square
Brands
3
33
11.00
Error
12
51
4.25
Table 2.8: Car design 3: Randomised
Block Design.
Car 1
Car 2
Car 3
Car 4
B(14)
D(11)
A(13)
C(9)
C(12)
C(12)
B(13)
D(9)
A(17)
B(14)
D(10)
B(8)
D(13)
A(14)
C(10)
A(13)
F stat
2.59
sta2005s: design and analysis of experiments
The treatment sum of squares from the RBD is the same as in
the CRD. The error sum of squares is reduced from 51 to 11.5 with
the loss of three degrees of freedom. The F-test for treatment effects
now shows evidence for differences between the tyre brands (Table
2.10).
Another source of variation would be from the wheels on which
a treated tyre was placed. To have a tyre of each type on each wheel
position on each car would mean that we would need 64 tyres for
the experiment, rather expensive! Using a Latin Square Design
makes it possible to put a treated tyre in each wheel position and
use all four treatments on each car (Table 2.11).
Within this arrangement A appears in each car and in each wheel
position, and the same applies to B and C and D, but we have not
had to increase the number of tyres needed.
Blocking for cars and wheel position has reduced the error sum
of squares to 6.00 with the loss of 3 degrees of freedom (Table 2.12).
The above example illustrates how the design can change results. In reality one cannot change the analysis after the experiment
was run. The design determines the model and all the above considerations, whether one should block by car and wheel position
have to be carefully thought through at the planning stage of the
experiment.
References
1. Hicks CR, Turner Jr KV. (1999). Fundamental Concepts in the Design of Experiments. 5th edition. Oxford University Press.
23
Table 2.9: Rearranged data for car
design 3, RBD.
Tread loss
Treatment
Car 1 Car 2 Car 3
A
17
14
13
B
14
14
13
C
12
12
10
D
13
11
10
Table 2.10: ANOVA table for car
design 3, RBD.
Source df
SS
Mean Square
Tyres
3
33
11.00
Cars
3
39.5
13.17
Error
9
11.5
1.28
Car 4
13
8
9
9
F stat
8.59
10.28
Table 2.11: Car design 4: Latin Square
Design.
Wheel position
Car 1 Car 2 Car 3
1
A
B
C
2
B
C
D
3
C
D
A
4
D
A
B
Table 2.12: ANOVA table for car
design 4, LSD.
Source
df
SS
Mean Square
Tyres
3
33
11.00
Cars
3
39.5
13.17
Wheels
3
5.5
1.83
Error
6
6.0
1.00
Car 4
D
A
B
C
F stat
11.00
13.17
3
The Linear Model for Single-Factor Completely Randomized Design Experiments
3.1 The ANOVA linear model
A single-factor completely randomised design experiment results
in groups of observations, with (possibly) different means. In a
regression context, one could write a linear model for such data as
yi = β 0 + β 1 L1i + β 2 L2i + . . . + ei
with L1, L2, etc. dummy variables indicating whether response i
belongs to group j or not.
However, when dealing with only categorical explanatory variables, as is typical in experimental data, it is more common to write
the above model in the following form:
Yij = µ + αi + eij
(3.1)
The dummy variables are implicit but not written. The two models are equivalent in the sense that they make exactly the same
assumptions and describe exactly the same structure of the data.
Model 3.1 is sometimes referred to as an ANOVA model, as opposed to a regression model.
Both models can be written in matrix notation as Y = Xβ + e (see
next section).
3.2 Least Squares Parameter Estimates
Example: Three different methods of instruction in speed-reading
were to be compared. The comparison is made by testing the comprehension of a subject at the end of one week’s training in the
sta2005s: design and analysis of experiments
given method. Thirteen students volunteered to take part. Four
were randomly assigned to Method 1, 4 to Method 2 and 5 to
Method 3.
After one week’s training all students were asked to read an
identical passage on a film, which was delivered at a rate of 300
words per minute. Students were then asked to answer questions
on the passage read and their marks were recorded. They were as
follows:
Mean
Std. Dev.
Method 1
82
80
81
83
Method 2
71
79
78
74
81.5
1.29
75.5
3.70
Method 3
91
93
84
90
88
89.2
3.42
We want to know whether comprehension is higher for some
of these methods of speed-reading, and if so, which methods work
better.
When we have models where all explanatory variables (factors)
are categorical (as in experiments), it is common to write them as
follows. You will see later why this parameterisation is convenient
for such studies.
Yij = µ + αi + eij
where Yij is the jth observation with the ith method
µ is the overall or general mean
αi is the effect of the ith method / treatment
Here αi = µi − µ, i.e. the change in mean response with treatment i relative to the overall mean. µi is the mean response with
treatment i: µi = µ + αi .
By effect we mean here: the change in response with the particular
treatment compared to the overall mean. For categorical variables, effect
in general refers to a change in response relative to a baseline category or an overall mean. For continuous explanatory variables, e.g.
in regression models, we also talk about effects, and then mostly
mean the change in mean response per unit increase in x, the explanatory variable.
Note that we need 2 subscripts on the Y - one to identify the
group and the other to identify the subject within the group. Then:
Y1j = µ + α1 + e1j
Y2j = µ + α2 + e2j
25
sta2005s: design and analysis of experiments
Y3j = µ + α3 + e3j
Note that there are 4 parameters, but only 3 groups or treatments.
Linear model in matrix form
To put our model and data into matrix form we string out the data
for the groups into an N × 1 vector, where the first n1 elements are
the observations on Group 1, the next observations n2 , etc.. Then
the linear model, Y = Xβ + e, has the form













Y=












Y11
Y12
Y13
Y14
Y21
Y22
Y23
Y24
Y31
Y32
Y33
Y34
Y35


 
 
 
 
 
 
 
 
 
 
 
 
=
 
 
 
 
 
 
 
 
 
 
 
 
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1









 


 
 
×
 











µ
α1
α2
α3







 


 
 
+
 











e11
e12
e13
e14
e21
e22
e23
e24
e31
e32
e33
e34
e35


























Note that
1. The entries of X are either 0 or 1 (because here all terms in the
structural part of the model are categorical). X is often called the
design matrix because it describes the design of the study, i.e. it
describes which of the factors in the model contributed to each
of the response values.
2. The sum of the last three columns of X add up to the first column. Thus X is a 13 × 4 matrix with column rank of 3. The
matrix X′ X will be a 4 × 4 matrix of rank 3.
3. From row 1: Y11 = µ + α1 + e11 . What is Y32 ?
To find estimates for the parameters we can use the method of
least squares or maximum likelihood, as for regression. The least
squares estimates minimise the error sum of squares:
SSE = (Y − Xβ)′ (Y − Xβ)
= ∑ ∑(Yij − µ − αi )2
i
j
26
sta2005s: design and analysis of experiments
β′ = (µ, α1 , α2 , α3 ) and the estimates are given by the solution to
the normal equations
X ′ Xβ = X ′ Y
Since the sum of the last three columns of X is equal to the first
column there is a linear dependency between columns of X, and
X ′ X is a singular matrix, so we cannot write
β̂ = ( X ′ X )−1 X ′ Y
The set of equations X ′ Xβ = X ′ Y are consistent, but have an
infinite number of solutions. Note that we could have used only
3 parameters µ1 , µ2 , µ3 , and we actually only have enough information to estimate these 3 parameters, because we only have 3
group means. Instead we have used 4 parameters, because the
parametrization using the effects αi is more convenient in the analysis of variance, especially when calculating treatment sums of
squares (see later). However, we also know that
Nµ = n1 µ1 + n2 µ2 + n3 µ3 = n1 µ + n1 α1 + n2 µ + n2 α2 + n3 µ + n3 α3 .
The RHS becomes (n1 + n2 + n3 )µ + ∑i ni αi = Nµ + ∑i ni αi . From
this follows that ∑i ni αi = 0. The normal equations don’t know
this so we add this additional equation (to calculate the fourth
parameter from the other three) as a constraint in order to get the
unique solution.
In other words, if we have ∑i ni αi = 0 then the αi ’s have exactly
the meaning intended above: they measure the difference in mean
response with treatment i compared to the overall mean; µi =
µ + αi .
We could define the αi ’s differently, by using a different constraint, e.g.
Yij = µ + αi + eij ,
α1 = 0
Here the mean for treatment 1 is used as a reference category
and equals µ. Then α2 and α3 measure the difference in mean between group 2 and group 1 and between group 3 and group 1
respectively. This parametrization is the one most common in regression, e.g. when you add a categorical variable in a regression
model the β estimates are defined like this: as differences relative to
the first/baseline/reference category.
Now, back to a solution for the normal equations:
27
sta2005s: design and analysis of experiments
1. A constraint must be applied to obtain a particular solution β̂.
2. The constraint must remove the linear dependency, so it cannot
be any linear combination of the rows of X. Denote the constraint
by Cβ = 0.
3. The estimate of β subject to the given constraint is unique. For
this reason the constraint should be specified as part of the
model. So we write
Yij = µ + αi + eij
∑ ni αi = 0
or in matrix notation
Y = Xβ + e
Cβ = 0
where no linear combination of the rows of C is a linear combination of the rows of X.
4. Although the estimates of β depend on the constraints used, the
following quantities are unique.
(a) The fitted values Ŷ = X β̂.
(b) The Regression or Treatment Sum of Squares.
(c) The Error Sum of Squares, (Y − X β̂)′ (Y − X β̂).
(d) All linear combinations, ℓ′ β̂, where ℓ = L′ X (predictions fall
under this category).
These quantities are called estimable functions of the parameters.
Speed-reading Example













Y=












82
80
81
83
71
79
78
74
91
93
84
90
88


 
 
 
 
 
 
 
 
 
 
 
 
=
 
 
 
 
 
 
 
 
 
 
 
 
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1








 


 
 
×
 












µ
α1
α2
α3







 


 
 
+
 











e11
e12
e13
e14
e21
e22
e23
e24
e31
e32
e33
e34
e35













 = Xβ + e












28
sta2005s: design and analysis of experiments
The normal equations, X ′ Xβ = X ′ Y are






1
1
0
0
1
1
0
0
1
1
0
0

1
 1

=
 0
0
1
1
0
0
1
1
0
0
1
0
1
0
1
1
0
0
1
0
1
0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
1
0
0
1
1
0
1
0
1
0
0
1
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1

























1
0
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0


























0
0
0
0
1
1
1
1
0
0
0
0
0
82
80
81
83
71
79
78
74
91
93
84
90
88

0
0 


0 

0 

0 


0 

0 


0 


1 

1 

1 


1 
1


























Multiply X ′ Xβ = X ′ Y





13
4
4
5

µ
4 4 5


4 0 0   α1

0 4 0   α2
0 0 5
α3


 
 
=
 
1074
326
302
446


 
 
=
 
∑ij Yij
∑ j Y1j
∑ j Y2j
∑ j Y3j





The sum of the last three columns of X ′ X = column 1, hence the
columns are linearly dependent.
So
• X ′ X is a 4 × 4 matrix with rank 3.
• X ′ X is singular
• ( X ′ X )−1 does not exist
• There are an infinite number of solutions that satisfy the equations! To find the particular solution we require, we add the
constraint, which defines how the parameters are related to each
other.
µ
α1
α2
α3





29
sta2005s: design and analysis of experiments
The effect of different constraints on the solution to the normal equations
We illustrate the effect of different sets of constraints on the least
squares estimates using the speed-reading example. The normal
equations are:



X ′ Xβ = 

13
4
4
5

µ
4 4 5
 α
4 0 0 
 1

0 4 0   α2
α3
0 0 5


 
 
=
 
1074
326
302
446



 = X′Y

The sum-to-zero constraint ∑ ni αi = 0:
This constraint can be written as
Cβ = 0µ + 4α1 + 4α2 + 5α3 = 0
Using this constraint the normal equations become





13
4
4
5

µ
0 0 0
 α
4 0 0 
 1

0 4 0   α2
α3
0 0 5


 
 
=
 
1074
326
302
446

1074
326
302
446





and their solution is
µ̂ = 82.62
α̂1 = −1.12
α̂2 = −7.12
α̂3 = 6.58

β̂′ X ′ Y =
82.62
−1.12 −7.12 6.58






 = 89153.2

β̂′ X ′ Y = SSmean + SStreamtment = ∑ ∑(Y¯i. − Ȳ.. )2 + ∑ ∑(Ȳ.. − 0)2 .
This assumes that the total sum of squares was calculated as ∑ Yi′ Yi .
β̂′ X ′ Y is used here because it is easier to calculate by hand than the
usual treatment sum of squares β̂′ X ′ Y − N1 Y ′ Y.
The error sum of squares is (Y − X β̂)′ (Y − X β̂) = 92.8.
The fitted values are
Ŷ1j = µ̂ + α̂1 = 81.5 in Group 1
Ŷ2j = µ̂ + α̂2 = 75.5 in Group 2
30
sta2005s: design and analysis of experiments
Ŷ3j = µ̂ + α̂3 = 89.2 in Group 3
The corner-point constraint α1 = 0:
This constraint is important as it is the one used most frequently
for regression models with dummy or categorical variables, e.g.
regression models fitted in R.
Now

Cβ =
0
1
0
0




µ
α1
α2
α3



 = α1 = 0

This constraint is equivalent to removing the α1 equation from
the model, so we strike out the row and column of the X’X corresponding to the α1 and the normal equations become

13

 4
5


 
µ
1074
4 5


 
4 0   α2  =  302 
446
0 5
α3
µ̂ = 81.5
α̂1 = 0
α̂2 = −6.0
α̂3 = 7.7

β̂′ X ′ Y =
81.5
0
−6 7.7




1074
326
302
446



 = 89153.2

The error sum of squares is (Y − X β̂)′ (Y − X β̂) = 92.8
The fitted values are
Ŷ1j = µ̂ = 81.5 in Group 1
Ŷ2j = µ̂ + α̂2 = 75.5 in Group 2
Ŷ3j = µ̂ + α̂3 = 89.2 in Group 3
which are the same as previously. However, the interpretation
of the parameter estimates is different. µ is the mean of treatment
1, α2 is the difference in means between treatment 2 and treatment
1, etc.. Treatment 1 is the baseline or reference category. This is the
parametrization typically used when fitting regression models, e.g.
in R, which calls it ’treatment contrasts’.
The constraint µ = 0 will result in the cell means model: yij =
αi + eij or µi + eij .
We summarise the effects of using different constraints in the
table below:
31
sta2005s: design and analysis of experiments
Model
Constraint
µ
α̂1
α̂2
α̂3
Ŷ1j
Ŷ2j
Ŷ3j
β̂′ X ′ Y
Error SS
µ + αi
µ + αi
αi
∑ ni αi = 0
α1 = 0
µ=0
82.6
81.5
0
-1.1
0
81.5
-7.1
-6
75.5
6.6
7.7
89.2
81.5
81.5
81.5
75.5
75.5
75.5
89.2
89.2
89.2
89153.2
89153.2
89153.2
92.8
92.8
92.8
32
We will be using almost exclusively the sum-to-zero constraint
as this has a convenient interpretation and connection to sums of
squares, and the analysis of variance.
Design matrices of less than full rank
If the design matrix X has rank r less than p (number of parameters), there is not a unique solution for β. There are three ways to
find a solution:
1. Reducing the model to one of full rank.
2. Finding a generalized inverse ( X ′ X )− .
3. Imposing identifiability constraints.
To reduce the model to one of full rank we would reduce the
parameters to µ, α2 , α3 , . . ., with α1 implicitly set to zero1 .
1
This is what R uses by default in its
lm() function (corner-point constraint).
We won’t deal with generalized inverses in this course.
To impose the identifiability constraints we write the constraint
as Hβ = 0. And then solve the augmented normal equations:
X ′ Xβ = X ′ Y and H ′ Hβ = 0
"
# "
#
X′ X
X′Y
β = Gβ
=
0
H′ H
(X′ X + H′ H )β = X′ Y
β̂ = ( X ′ X + H ′ H )−1 X ′ Y
Parameter estimates for the single-factor completely randomised design
Suppose an experiment has been conducted as a completely randomised design: N subjects were randomly assigned to a treatments, where the ith treatment has ni subjects, with ∑ ni = N, and
Yij = jth observation in ith treatment group. The data have the form:
sta2005s: design and analysis of experiments
Group
II
...
I
Y11
Y12
..
.
Y1n1
Means
Totals
Variances
Y 1·
Y1·
s1 2
a
Y21
Y22
Ya1
Ya2
..
.
Yana
Y2n2
Y 2·
Y2·
s2 2
Y a·
Ya·
sa 2
Y ··
Y··
The first subscript is for the treatment group, the second for
the replication. The group totals and means are expressed in the
following dot notation:
ni
group total Yi· =
∑ Yij
j =1
group mean Y i· = Yi· /ni
and
overall total
Y·· =
∑ ∑ Yij
i
overall mean
j
Y ·· = Y·· /N
Let Yij = jth observation in ith group. The model is:
Yij = µ + αi + eij
∑ ni αi = 0
where
µ
αi
eij
= general mean
= effect of the ith level of treatment factor A
= random error distributed as N (0, σ2 ).
The model can be written in matrix notation as
Y = Xβ + e
with
e ∼ N (0, σ 2 I )
33
sta2005s: design and analysis of experiments
Cβ = 0
where














Y=












Y11
Y12
..
.
Y1n1
Y21
Y22
..
.
Y2n2
..
.
Ya1
..
.
Yana


 
 
 
 
 
 
 
 
 
 
 
 
 
=
 
 
 
 
 
 
 
 
 
 
 
 
1
1
..
.
1
1
1
..
.
1
..
.
1
..
.
1
1
1
0
0
1 0
0 1
0 1
0
1
0
0
0
0
...
...
...

0
0 




 
0 


0 
 
 
0  
 
×
 
 

0 
 




1 




µ
α1
α2
..
.
..
.
αa




 


 
 
 
 
 
+
 
 
 
 







1
e11
e12
..
.
..
.
..
.
..
.
..
.
..
.
..
.
eana













 = Xβ + e











and the constraint as

Cβ =
0
n1
n2
...



na 



µ
α1
α2
..
.
αa




=0



There are a + 1 parameters subject to 1 constraint. To estimate
the parameters we minimize the residual/error sum of squares
S = (Y − Xβ)′ (Y − Xβ)
= ∑(Yij − µ − αi )2
ij
where ∑ij = ∑i ∑ j . Let’s put numbers to all of this and assume
a = 3, n1 = 4, n2 = 4 and n3 = 5. Then
34
sta2005s: design and analysis of experiments


























Y11
Y12
Y13
Y14
Y21
Y22
Y23
Y24
Y31
Y32
Y33
Y34
Y35


1
1
1
1
1
1
1
1
1
1
1
1
1
 
 
 
 
 
 
 
 
 
 
 
 
=
 
 
 
 
 
 
 
 
 
 
 
 
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1








 


 
 
×
 












µ
α1
α2
α3


 + eij

subject to Cβ = 0

Cβ =
0
4
4




5
µ
α1
α2
α3



=0



1
0
1
0
1
0
1
0

4 4 5
µ


4 0 0   α1

0 4 0   α2
0 0 5
α3



X′X β = 




=

13
4
4
5
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
0
1
0

 
 
=
 
1
0
1
0
1
0
0
1
13µ
4µ
4µ
5µ
1
0
0
1
1
0
0
1
1
0
0
1
+ 4α1
+ 4α1
1
0
0
1

























+ 4α2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
+ 5α3





+ 4α2
+ 5α3


























µ
α1
α2
α3





35
sta2005s: design and analysis of experiments


1
 1

X′Y = 
 0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
0
1
0

∑ij Yij
∑ j Y1j
∑ j Y2j
∑ j Y3j


=

1
0
1
0
1
0
1
0
1
0
1
0

1
0
0
1

1
0
0
1
1
0
0
1
NY ··
n 1 Y 1·
n 2 Y 2·
n 3 Y 3·
 
 
=
 
1
0
0
1
1
0
0
1






























which results in the normal equations:
13µ
4µ
4µ
5µ
+ 4α1
+ 4α1
+ 4α2
+ 5α3
+ 4α2
+ 5α3
= 13Y ··
= 4Y1·
= 4Y2·
= 5Y3·
The constraint says that:
0µ
+ 4α1
+ 4α2
+ 5α3
Which implies that
µ̂
= 13Y ··
=⇒
Y ··
=
α̂i
= Y i· − Y ··
13µ
And that
So to summarize , the normal equations are
X ′ Xβ = X ′ Y
=0
Y11
Y12
Y13
Y14
Y21
Y22
Y23
Y24
Y31
Y32
Y33
Y34
Y35


























36
sta2005s: design and analysis of experiments
where




X ′ Xβ = 



N
n1
n2
..
.
nn
n1
n1
0
..
.
0
n2
0
n2
..
.
0
na
0
0
..
.
na
...
...









µ
α1
..
.
αa







 = X′Y = 





NY ··
n 1 Y 1·
n 2 Y 2·
..
.
n a Y a·








Using the constraint

Cβ =
0
n1
n2
na
...





µ
α1
..
.
αa



=0


the set of normal equations are
n1 µ
Nµ
+ n1 α1
..
.
= NY ··
= n 1 Y 1·
..
.
na µ
+ na αa
= n a Y a·
Solving these equations gives the least squares estimators
µ̂ = Y ··
µ̂i = Y i·
and
α̂i = Y i· − Y ··
for i = 1, . . . , a. Parameter estimation for many of the standard
experimental designs is straightforward! From general theory we
know that the above are unbiased estimators of µ and the αi ’s. An
unbiased estimator of σ2 is found by using the minimum value of
the residual sum of squares, SSE, and dividing by its degrees of
freedom.
min(SSE) =
∑(Yij − µ̂ − α̂i )2
ij
= ∑(Yij − Y i· )2
ij
E(Yij − Ȳi. )2 = Var (Yij − Ȳi. ) = σ2 (1 −
1
)
n
37
sta2005s: design and analysis of experiments
Hint: Cov(Yij , n1 ∑ j Yij ) = n1 σ2
Then
E[SSE] = E
h
∑ ∑(Yij − Ȳi. )2
1
= anσ2 (1 − ) = a(n − 1)σ2
n
i
SSE
E[ MSE] = E
= σ2
N−a
So
s2 =
1
N−a
∑(Yij − Yi· )2
ij
is an unbiased estimator for σ2 , with ( N − a) degrees of freedom
since we have N observations and (a + 1) parameters subject to 1
constraint. This quantity is also called the Mean Square for Error
or MSE.
3.3 Standard Errors and Confidence Intervals
Mostly, the estimates we are interested in are linear combinations
of treatment means. In such cases it is relatively straightforward to
calculate the corresponding variances (of the estimates):
Var (µ̂)
= Var (Y ·· ) = Var (∑i ∑ j Yij /N )
= N12 ∑i ∑ j Var (Yij )
=
=
1
Nσ2
N2
σ2
N
2
The estimated variance is then sN where s2 is the mean square for
error (least squares estimate, see above).
Var (α̂i )
= Var (Y i· − Y ·· )
= Var (Y i· ) + Var (Y ·· ) − 2Cov(Y i· Y ·· )
Consider
Cov(Y i· Y ·· )
= Cov(Y i· , ∑ nNk Y k· )
= ∑k nNk Cov(Y i· Y k· )
But since the groups are independent cov(Y i· Y k· ) is zero if i ̸= k.
2
If i = k then cov(Y i· Y k· ) = Var (Y i· ) = σn . Using this result and
summing we find cov(Y i· Y ·· ) =
σ2
N.
i
Hence
38
sta2005s: design and analysis of experiments
Var (α̂i )
=
=
=
σ2
σ2
2σ2
ni + N − N
( N − ni ) σ2
ni N
σ2
σ2
ni − N
Important estimates and their standard errors
A standard error is the (usually estimated) standard deviation of
an estimated quantity. It is the square root of the variance of the
sampling distribution of this estimator and is an estimate of its
precision or uncertainty.
Parameter
Estimate
Standard Error
√S
N
Overall mean
µ
Y ··
Experimental error variance
σ2
s2 =
ith
αi
Y i· − Y ··
Effect of the
treatment
Difference between two treatments
α1 − α2
µi = µ + αi
Treatment mean
1
N −a
∑ij (Yij − Y i· )2
Y 1· − Y 2·
q
Y i·
q
How do we estimate σ2 ?
s2
=
=
=
=
q
1
N −a
∑ij (Yij − Y i· )2
Mean Square for Error = MSE
Within Sum of Squares/(Degrees of Freedom)
SSresidual /d f residual
Assuming (approximate) normality of an estimator, a confidence
interval for the corresponding population parameter has the form
estimate ± tα/2
× standard error
ν
Where tα/2
is the α/2th percentile of Student’s t distribution with
ν
ν degrees of freedom. The degrees of freedom of t are the degrees
of freedom of s2 .
Speed-reading Example
Estimates, standard errors and confidence intervals:
ni 2
( Nn−
)s
iN
s2 ( n1 +
1
s2
ni
1
n2 )
= sed
39
sta2005s: design and analysis of experiments
Effect (αi )
Method I
Method II
Method III
Mean (µi )
Method I
Method II
Method III
Overall mean
Estimate
α̂1 = −1.12
α̂2 = −7.12
α̂3 = 6.58
Standard Error
1.27
1.27
1.25
µ̂1 =
µ̂2 =
µ̂3 =
µ = 82.62
95% Confidence Interval
0.85
3.4 Analysis of Variance (ANOVA)
The next step is testing the hypothesis of no treatment effect: H0 :
α1 = α2 = . . . = 0. This is done by a method called Analysis of
Variance, even though we are actually comparing means.
Note that so far we have not used the assumption of eij ∼
N (0, σ2 ). The least squares estimates do not require the assumption of normal errors! However, to construct a test for the above
hypothesis we need the normality assumption. In what follows,
we assume that the errors are identically and independently distributed as N (0, σ2 ). Consequently the observations are normally
distributed, though not identically. We must check this assumption
of independent, normally distributed errors, else our test of the
above hypothesis could give a very misleading result.
Decomposition of Sums of Squares
Let’s assume Yij are data obtained from a CRD, and we are assuming model 3.1:
Yij = µ + αi + eij
In statistics, sums of squares refer to squared deviations (from
a mean or expected value), e.g. the residual sum of squares is the
sum of squared deviations of observed from fitted values. Lets
rewrite the above model by substituting observed values and
rewriting the terms as deviations from means:
Yij − Ȳ·· = (Ȳi. − Ȳ.. ) + (Yij − Yi. )
Make sure you agree with the above. Now square both sides and
sum over all N observations:
40
sta2005s: design and analysis of experiments
∑ ∑(Yij − Y·· )2 = ∑ ∑(Yij − Yi· )2 + ∑ ∑(Yi· − Y·· )2 + 2 ∑ ∑(Yij − Yi· )(Yi· − Y·· )
i
j
i
j
i
j
i
j
The crossproduct term is zero after summation over j since it can
be written as
2 ∑(Y i· − Y ·· ) ∑(Yij − Y i· )
i
j
The second term is the sum of the observations in the ith group
about their mean value, so the sum is zero for each i. Hence
∑ ∑(Yij − Y·· )2 = ∑ ∑(Yi· − Y·· )2 + ∑ ∑(Yij − Yi· )2
i
j
i
j
i
j
So the total sum of squares partitions into two components: (1)
squared deviations of the treatment means from the overall mean,
and (2) squared deviations of the observations from the treatment
means. The latter is the residual sum of squares (as in regression,
the treatment means are the fitted values). The first sum of squares
is the part of the variation that can be explained by deviations of
the treatment means from the overall mean. We can write this as
SStotal = SStreatment + SSerror
The analysis of variance is based on this identity. The total sum
of squares equals the sum of squares between groups plus the sum
of squares within groups.
Distributions for Sums of Squares
Each of the sums of squares above can be written as a quadratic
form:
Source
treatment A
residual
total
SS
df
Y′ ( H
1
n J )Y
−
SSA =
SSE = Y ′ ( I − H )Y
SST = Y ′ ( I − n1 J )Y
a−1
N−a
N−1
where J is the n × n matrix of ones, and H is the hat matrix
X ( X ′ X ) −1 X ′ .
41
sta2005s: design and analysis of experiments
Q: From your regression notes, what does that imply for the
distributions of these sums of squares?
Cochran’s Theorem
Let Zi ∼ iidN (0, 1) with i = 1, . . . , v
v
∑ Zi2 = Q1 + Q2 + . . . + Qs
i =1
s ≤ v and Qi has vi d.f.
then Q1 , . . . , Qs are independent χ2 random variables with v1 , . . . , vs
d.f., respectively, if and only if
v = v1 + v2 + . . . + v s
Expected Mean Squares
LEMMA A. Let Xi , where i = 1, . . . , n, be independent random
variables with E( Xi ) = µi and Var ( Xi ) = σ2 . Then
E( Xi − X̄ )2 = (µi − µ̄)2 +
n−1 2
σ
n
where
µ̄ =
n
1
n
∑ µi
i =1
[Hint: E(U 2 ) = [ E(U )]2 + Var (U ). Take U = Xi − X̄. ]
THEOREM A. Under the assumptions for the model Yij = µ +
αi + eih , and assuming all ni = n
E(SSerror )
=
∑ ∑ E(Yij − Ȳi. )2
i
j
= ( N − a ) σ2
E(SStreatment )
=
∑ ∑ E(Ȳi. − Ȳ.. )2
i
=
j
∑ ni E(Ȳi. − Ȳ.. )2
i
=
∑ ni α2i + (a − 1)σ2
i
2
error
MSE = SS
N − a may be used as an estimate for σ . It is an unbiased estimator. If all the αi are equal to zero, then the expectation of
SStreatment /( a − 1) is also σ2 !
42
sta2005s: design and analysis of experiments
F test of H0 : α1 = α2 = · · · = α a = 0
THEOREM B. If the errors are independent and normally distributed with means 0 and variances σ2 , then SSerror /σ2 follows a
chi-square distribution with ( N − a) degrees of freedom. If, additonally, the αi are all equal to zero, then SStreatment /σ2 follows a
chi-square distribution with a − 1 degrees of freedom and is independent of SSerror .
Proof. We first consider SSerror . From STA2004F
1
σ2
ni
∑ (Yij − Ȳi. )2
j =1
follows a chi-square distribution with ni − 1 degrees of freedom.
There are a such sums in SSerror , and they are independent of each
other since the observations are independent. The sum of a independent chi-square random variables that each have n − 1 degrees
of freedom follows a chi-square distribution with N − a degrees of
freedom. The same reasoning can be applied to SStreatment , noting
that Var (Ȳi. ) = σ2 /ni .
We next prove that the two sums of squares are independent of
each other. SSerror is a function of the vector U, which has elements
Yij − Ȳi. , where i = 1, . . . , a and j = 1, . . . , n. SStreatment is a function
of the vector V, whose elements are Ȳi. . Thus, it is sufficient to
show that these two vectors are independent of each other. First, if
i ̸= i′ , Yij − Ȳi. and Ȳi′ . are independent since they are functions of
different observations. Second, Yij − Ȳi. and Ȳi. are independent (by
another Theorem from STA2004F). This completes the proof of the
theorem.
■
Under H0 : α1 = α2 = · · · = α a = 0
F=
SStreatment
a −1
SSerror
N −a
=
MStreatment
MSE
has a central F distribution:
F ∼ Fa−1,N − a
E[ F ] =
d2
d2 − 2
where d2 is the denominator degrees of freedom. If H0 is false,
F has a non-central F distribution with non-centrality parameter
43
sta2005s: design and analysis of experiments
44
∑ n α2
λ = i σ2i i , and the statistic will tend to be larger than 1. Large
values of F therefore provide evidence against H0 .
This is always a one-sided test. Why?
THEOREM C. Under the assumption that the errors are normally distributed, the null distribution of F is the F distribution
with ( a − 1) and ( N − a) degrees of freedom.
Proof. The proof follows from the definition of the F distribution, as the ratio of two independent chi-square random variables
divided by their degrees of freedom.
■
ANOVA table
These results can be summarised in an analysis of variance (ANOVA)
table.
Source
SS
df
Mean Square
F
SS A
a −1
SSE
N −a
MS A
MSE
treatment A
∑i ni (Y i· − Y ·· )2
a−1
residual (error)
∑i ∑ j (Yij − Y i· )2
N−a
)2
N−1
total
∑i ∑ j (Yij − Ȳ..
EMS
∑ ni α2i
a −1
2
σ
σ2 +
This is a one-way analysis of variance. The ‘one-way’ refers to there
only being one factor in the model and thus in the ANOVA table.
Note that the ANOVA table is still based on the model 3.1, and will
have one SS for each term in the model (except the mean), but see
the table below.
To test H0 : α1 = α2 = · · · = α a = 0, we use F =
MS A
MSE
∼ Fa−1,N −a .
Some Notes
1. An alternative, less common, form for the ANOVA table is
Source
mean µ
treatment
residual (error)
total
SS
2
NY
∑i ni (Y i· − Y ·· )2
∑i ∑ j (Yij − Y i· )2
∑i ∑ j Yij2 = Y ′ Y
df
Mean Square
F
1
a−1
N−a
N
SS A
a −1
SSE
N −a
MS A
MSE
There is an extra term due to the mean with 1 degree of freedom. The error and treatment SS, and the F test, are the same as
previously.
sta2005s: design and analysis of experiments
In the table above we have split the total variation, SStot =
∑i ∑ j Yij2 with N degrees of freedom into three parts, namely
SStot = SSµ + SS A + SSE
with degrees of freedom
N = 1 + ( a − 1) + ( N − a )
respectively. Each SS can be identified with a term in the model
Yij = µ + αi + eij
for
i
= 1, . . . , a
j
= 1, . . . , ni
2. We have closed form expressions for each of the Sums of Squares.
This is in contrast to multiple regression, where usually explicit
expressions cannot be given for the individual regression sum of
squares. Furthermore, subject to the constraints, we have closed
form expressions for the parameter estimates as well.
3. The error sum of squares can be written as
a
∑ (ni − 1)s2i
i =1
where s2i is the variance of the ith group,
s2i =
∑(Yij − Yi· )2 /(ni − 1)
So the Mean Square for Error, MSE, is a pooled estimate of σ2 ,
s2 =
(n1 − 1)s21 + (n2 − 1)s22 + . . . + (n a − 1)s2a
n1 + n1 + . . . + n a − a
For a = 2, this is the estimate of σ2 we use in the two-sample
t-test (assuming equal variances).
4. The treatment sum of squares could also be used to estimate σ2
if we assume H0 is true. Recall that for any mean X̄, Var ( X̄ ) =
σ2 /n, so SS A = ∑i ni (Y i· − Y ·· )2 measures the variation in the
group means about the overall mean. If the means do not differ,
45
sta2005s: design and analysis of experiments
i.e. H0 is true, then SS A /( a − 1) should also estimate σ2 . So the
test of
H0 : α1 = α2 = · · · = α a = 0
made using
MS A
SS A /( a − 1)
=
∼ Fa−1,N −a
MSE
SSE( N − a)
is an F-test comparing variances. This is the origin of the term
Analysis of Variance.
5. The Analysis of Variance for comparing the means of a number of groups is equivalent to a regression analysis. However,
in ANOVA, the emphasis is slightly different. In regression
analysis, we test if an arbitrary subset of the parameters is zero
[H0 : β(2) = 0]. In ANOVA we are interested in testing if a
particular subset, namely α1 , α2 , . . . α a , are zero.
6. MSE, the mean square for error
2
MSE = NSSE
− a is an unbiased estimator of σ , provided the model
used is the correct one.
MSE = σ̂2 =
1
N−a
∑ ∑(Yij − Ȳi· )2 = s2
i
j
Note that ( N − a) = ∑ia=1 (ni − 1). This estimate of σ2 is used for
all comparisons of the treatment/group means.
Speed-reading Example
Mean
Std Deviation
ni
METHOD
I
II
III
82
71
91
80
79
93
81
78
84
83
74
90
88
81.5 75.5 89.2
1.3
3.7
3.4
4
4
5
46
sta2005s: design and analysis of experiments
Sums of Squares
N Ȳ··2
∑ Yij2
∑ ni Ȳi2·
SStotal
SS A
SSerror
=
=
=
=
=
=
88729
89246
89153
89246 − 88729
89153 − 88729
SStotal − SS A
ANOVA Table
Source
Teaching methods
Error
Total
SS
424
93
517
= 517 with (N − 1 = 12) df
= 424 with (α − 1 = 2) df
= 517 − 524 = 93 with (N − a = 10) df
df
2
10
12
Mean Square
212
9.3
F stat
22.8
p-value
0.0001862
From this table we would conclude that: There is strong evidence
(p < 0.001) that reading speed differs between teaching methods
(F = 22.8 ∼ F2,10 ).
Now that we have found evidence that the teaching methods
differ, we can finally answer the really important question: Which
is the best method? How much better is it? For this we need to
compare the three methods (means) amongst each other. We do this
in Chapter 4. We could also skip the ANOVA table and jump to the
real questions of interest immediately. But very often, an ANOVA is
performed to obtain a summary of which factors are responsible for
most of the variation in the data.
3.5 Randomization Test for H0 : α1 = . . . = α a = 0
Randomization tests can be used for data from properly randomized experiments. In fact, the ONLY assumption required for randomization tests to be valid and give exact p-values is that treatments were randomly assigned to experimental units according to
the rules of the particular experimental design. No assumptions
about normality, equal variance, random samples, independence
are needed. Therefore, some call randomisation tests the ultimate
nonparametric tests (Edgington, 2007). Fisher (1936) used this fact as
one of his strong arguments for the requirement of randomisation
in experiments and said about the F and t tests mostly used for the
analysis of experiments: “conclusions have no justification beyond
the fact that they agree with those which could have been arrived
at by this elementary method”, the elementary method being the
randomization tests.
The randomisation test is a statistical test in which the distribution of the test statistic under the null hypothesis is obtained by
calculating the test statistic under all possible rearrangements of the
observed data points.
47
sta2005s: design and analysis of experiments
The idea is as follows: if the null hypothesis is true, the treatment has no effect, and the observed values merely reflect natural
variation in the experimental units. Calculating the test statistic
(e.g. difference between means) under all possible randomizations
of treatments to experimental units then gives an idea of the distribution of the difference under H0 . Comparing our observed test
statistic to this null or reference distribution can tell us how likely
the observed statistic is relative to what we would expect under H0 ,
expressed as a p-value. The p-value will tell us how often we would
expect a difference this extreme, under the null hypothesis that the
treatments have no effect.
Example: Suppose there are 6 subjects, 2 treatments, 3 subjects
randomly assigned to each treatment. The response measured is the
reaction time (in seconds).
The null hypothesis will state that the mean (or median) reaction
time is the same for each treatment. The alternative hypothesis
states that at least one subject would have provided a different reaction
time under a different treatment.
The actual/observed/realised randomisation which resulted in
the observed responses, is only 1 of (63) = 20 possible randomisations of the 6 subjects to the 2 treatments. Each of these 20 possible
randomisations were equally likely. Under H0 the treatment has no
effect, i.e. the observed values are differences between the subjects
not due to treatments, and thus the observed values would stay the
same with a different randomisation.
We can now construct a reference distribution (under H0 ) from
the 20 test statistics obtained for the different possible randomisations. The observed test statistic is compared to this reference
distribution and the p-value calculated as the proportion of values ≥ observed test statistic (or ≤, depending on the alternative
hypothesis).
Example: Tomato Plants
This is an experiment whose objective was to discover whether
a change in the fertilizer mixture applied to tomato plants would
result in an increased yield.
Eleven plants in a single row, were randomly assigned so that 5
were given standard fertilizer mixture A and 6 were fed a supposedly improved mixture B.
The gardener took 11 playing cards, 5 red and 6 black, thoroughly shuffled these and then dealt them to result in a given sequence of red (A) and black (B) cards.
48
sta2005s: design and analysis of experiments
Position
Fertilizer
Pounds of Tomatoes
1
A
29.9
2
A
11.4
3
B
26.6
4
B
23.7
5
A
25.3
standard fertilizer A
modified fertilizer B
nA = 5
∑ y A = 104.2
ȳ A = 20.84
nB = 6
∑ y B = 135.2
ȳ B = 22.53
6
B
28.5
7
B
14.2
8
B
17.9
9
A
16.5
10
A
21.1
11
B
24.3
Table 3.1: Tomato Plant Data
difference in means (modified minus standard) ȳ B − ȳ A = 1.69
• H0 : modifying the fertilizer mixture has no effect on the results
and therefore, in particular, no effect on the mean. H0 : µ B −
µA = 0
• H1 : the modified fertilizer (B) gives a higher mean. H1 : µ B −
µA > 0
• There are theoretically
11!
= 462 possible ways of allocating 5 A’s and 6 B’s to the 11 plants
5!6!
The given experimental arrangement is just one of the 462, any
one of which could equally well have been chosen. To calculate the
randomisation distribution appropriate to the H0 that modification
is without effect (i.e. that µ A = µ B ), we need to calculate all 462
differences in the averages obtained from the 462 possible arrangements.
The table above shows one such arrangement with its corresponding difference in means = 1.69. Another arrangement could
have been:
Position in row
Fertilizer
Pounds of Tomatoes
1
A
29.9
2
B
11.4
standard fertilizer A
3
B
26.6
4
A
23.7
5
A
25.3
6
B
28.5
7
A
14.2
modified fertilizer B
nA = 5
nB = 6
∑ y A = 114.2
∑ y B = 125.2
ȳ A = 22.84
ȳ B = 20.87
ȳ B − ȳ A = −1.97
There are 460 more such arrangements with resulting differences
in means. These 462 differences are summarised by the histogram
below.
8
B
17.9
9
B
16.5
10
A
21.1
11
B
24.3
49
sta2005s: design and analysis of experiments
50
Figure 3.1: Randomisation distribution
for tomato plant data. The red cross
indicates the observed difference
ȳ A − ȳ B .
0.10
density
0.08
0.06
0.04
0.02
x
0.00
−10
−5
0
5
10
difference in means
The observed difference of 1.69 is indicated with a cross. We find
that in this example, 154 of the possible 462 arrangements yield
154
differences greater than or equal to 1.69: p = 462
= 0.33.
The p-value of 0.33 suggests that the observed difference in
means is likely under H0 , and therefore that we cannot conclude
that fertilizer B resulted in a higher mean.
As a comparison, the p-value from the two-sample t-test (assuming equal variance) is 0.34, very close. The reason is that in this
example the assumptions required for the t-test are met. However,
note that in the case where these assumptions are not met (e.g.
skew distributions), the randomisation test p-value will give us a
much better p-value (as a measure of how extreme the observed
statistic is relative to the null hypothesis).
3.6 A Likelihood Ratio Test for H0 : α1 = . . . = α a = 0
In general, likelihood ratio tests compare two nested models, by
comparing their likelihoods (in the form of a ratio). The likelihood
ratio compares the relative support for the two models based on the
information in the data.
For the hypothesis test H0 : α1 = . . . = α a = 0 we will compare
the following two models: let model Ω assume that there are differences between the treatments, and let model ω assume that the
treatments have no effect, i.e. a model that corresponds to H0 being
true.
sta2005s: design and analysis of experiments
51
(a) Model Ω is
Y = µ + αi + eij
eij ∼ N (0, σ2 )
Or equivalently
Yij ∼ N (µ + αi , σ2 )
(b) Model ω is (also called the restricted or the null model) is
Y = µ + eij
eij ∼ N (0, σ2 )
Or equivalently
Yij ∼ N (µ, σ2 )
A likelihood ratio test for H0 can be constructed using
λ=
L(ω̂ )
L(Ω̂)
(3.2)
where L(ω̂ ) is the maximized likelihood if H0 is true and L(Ω̂)
is the maximum value of the likelihood when the parameters are
unrestricted. Essentially, we need to fit two models.
To obtain the likelihood for model Ω, we are assuming independent observations, and can therefore multiply the probabilities of
the observations:
L(µ, αi , σ2 ) =
1
∏ √2πσ2 exp
ij
n −1
2σ2
(Yij − µ − αi )2
o
(3.3)
The log-likelihood is
l (µ, αi , σ2 ) =
−N
N
1
log2π − logσ2 − 2
2
2
2σ
a
ni
i
j
∑ ∑(Yij − µ − αi )2
(3.4)
where N = ∑1n ni . For fixed σ2 this is maximised when the last
term is a minimum. But this term is exactly the sum of squares
that was minimized when finding the least squares estimate! So the
least squares estimates are the same as the maximum likelihood
estimates. (Note that this is only true for normal models, normal
errors).
Let
Actually, we are approximating the
probability by the density of each
observation (probability = density
times constant. We are ignoring
the constant because multiplicative
constants have no effect on maximum
likelihood estimates).
sta2005s: design and analysis of experiments
RSS(Ω) =
∑ ∑(Yij − µ̂ − α̂i )2
i
(3.5)
j
then the maximized log-likelihood for fixed σ2 is
ℓ(Ω̂) = c −
RSS(Ω̂)
2σ2
(3.6)
where
c=−
N
log(2πσ2 )
2
Repeating the same argument for model ω, assuming α1 = α2 =
. . . = α a = 0:
L(ω ) =
1
∏ √2πσ2
!
(
exp
ij
l (ω ) =
−1
2σ2
∑ ∑(Yij − µ)2
i
−N
N
1
log(2π ) − log(σ2 ) − 2
2
2
2σ
)
j
∑ ∑(Yij − µ)2
i
(3.7)
j
For fixed σ2 this is maximised when
RSS(ω ) =
∑ ∑(Yij − µ)2
i
(3.8)
j
is a minimum, and this occurs when µ is the least squares estimate and so RSS(ω̂ ) = ∑i ∑ j (Yij − µ̂)2 , where µ̂ = Y ·· .
Then the maximum of (3.7) is (for fixed σ2 )
ℓ(ω̂ ) = c −
RSS(ω̂ )
2σ2
(3.9)
We now take minus twice the difference of the log-likelihoods
(corresponding to minus twice the log of the likelihood ratio)
"
λ=
−2logλ =
L(ω̂ )
L(Ω̂)
#
RSS(ω̂ ) − RSS(Ω̂)
σ2
(3.10)
(3.11)
This has, for large samples, a chi-squared distribution with ( N −
1) − ( N − a) = a − 1 degrees of freedom. Note that this criterion
52
sta2005s: design and analysis of experiments
looks at the reduction in the residual sum of squares and compares
this to the error variance. One remaining problem is that σ2 is not
known. In practice we estimate σ2 from the residual sum of squares
of the larger model.
σ̂2 =
RSS(Ω̂)
N−a
σ̂2 ( N − a)
∼ χ2N −a
σ2
Under the assumption of normality the likelihood ratio statistic
−2logλ has an exact chi-square distribution when σ2 is known.
When σ2 is estimated we use
F=
RSS(ω̂ )− RSS(Ω̂)
a −1
RSS(Ω̂)
N −a
∼ Fa−1,N −a
Again note that the F test depends on the normality assumption.
Verify that this is equivalent to the F test found in the sums of
squares derivation above.
For normal data, least squares and maximum likelihood will
result in the same solution (parameter estimates). Actually, even the
problem is the same: that of minimizing the error sum of squares.
When the data are not normally distributed, least squares still
provides estimates that minimize the squared deviations of the observed values from the estimated. They provide a best fit. However,
other methods such as maximum likelihood may result in better
out-of-sample prediction error.
3.7 Kruskal-Wallis Test
This is a nonparametric test to compare more than two independent populations. It is the nonparametric version of the one-way
ANOVA F-test which relies on normality of populations.
For the Kruskal-Wallis test, the assumptions are that we have k
independent random samples of sizes n1 , n2 , . . . , nk (k ≥ 3), independent observations within samples, that the k populations are
identical except possibly with respect to location, and the data must
be at least ordinal.
Hypotheses:
53
sta2005s: design and analysis of experiments
H0 : the k populations are identical (the k medians are equal)
H1 : k populations are identical except with respect to location
(median)
To calculate the test statistic we rank all observations from 1 to
N (N = ∑ik=1 ni ) (for ties assign the mean value of ranks to each
observation). The test statistic is based on comparing each group’s
mean rank with the mean of all ranks (weighted by sample size).
For each group i calculate Ri = ∑ ranks in group i
H
12
=
N ( N + 1)
12
=
N ( N + 1)
k
∑
∑
Ri
N+1 2
−
ni
2
R2i
− 3( N + 1)
ni
For large sample sizes the distribution of the Kruskal-Wallis
test statistic can be approximated by the χ2 -distribution with k − 1
degrees of freedom:
H ≈ χ2k−1
When sampling from a normal distribution, the power of the
Kruskal-Wallis test is almost equal to that of a classical F-test. When
outliers are present, the Kruskal-Wallis test is much more reliable
than the F-test.
54
4
Comparing Means: Contrasts and Multiple Comparisons
This chapter goes more deeply into methods for exploring WHICH
treatments differ, are best or worst, and by how much they differ.
We do this using contrasts. There are statistical problems that occur
when doing many tests (multiple comparisons or testing), and
you will learn about methods to reduce the problems of multiple
testing. It is important to be aware of these problems in order to be
able to interpret and understand results from experiments or other
types of analysis correctly.
In most experiments we want to compare a number of treatments.
For example, a new method of communicating with employees
is compared to the current system. We want to know 1) whether
there is an improvement (or change) in employee happiness, and,
more importantly, 2) how large this change is. For the latter we need
estimates and standard errors or confidence intervals.
The analysis of variance table tells us how much evidence there
is that the means differ. In this chapter we consider what the next
steps in the analysis are. If there is no evidence for differences
between the means, technically speaking the analysis is complete
at this stage. However, one should remember that there are two
possible situations/reasons that could both lead to this outcome of
no evidence that the means differ.
1. There is truly no difference between the means (or it is so small
that it is not interesting).
2. There is a difference, but the F-test did not have enough power.
Technically, the power of a test is the probability of rejecting H0
if false. Reasons for lack of power are:
(a) Too few observations on each mean.
sta2005s: design and analysis of experiments
(b) The variation, σ2 , is too large (relative to the differences between means). If this is the case, reducing σ2 by controlling
for extraneous factors should be considered.
Both of these are design issues. Therefore, it is crucial to think
about power at the design stage of the experiment (see Chapter
6).
Suppose however, that we have found enough evidence to warrant further investigation into which means differ, which don’t and
by how much they differ. To do this we contrast one group of means
with another, i.e. we compare groups of means to find out where the
differences between treatments are. We can do this in two ways,
either by constructing a test of the form
H0 : µ A − µ B = 0
or by constructing a confidence interval for the difference of the
form:
est(diff) ± tν × SE(diff)
The confidence interval is much more informative than the result
from a hypothesis test.
4.1 Contrasts
Consider the model Yij = µ + αi + eij with constraint ∑ αi = 0.
We will be assuming that all ni are equal. (It is possible to construct
contrasts with unequal ni , but this gets very complicated. This is
one reason to design a CRD experiment with an equal number of
experimental units per treatment.)
Definition: A linear combination of the parameters αi , L =
such that ∑ hi = 0 is called a contrast.
∑1a hi αi ,
Point estimates of Contrasts and their variances
The maximum likelihood (and least squares) estimate of L is
56
sta2005s: design and analysis of experiments
L̂
= ∑1a hi α̂i
= ∑1a hi (Ȳi· − Ȳ·· )
= ∑1a hi Ȳi·
since ∑ hi Ȳ·· = Ȳ·· ∑ hi = 0
Var ( L̂)
= ∑1a h2i Var (Ȳi· ) = σ2 ∑1a
ˆ ( L̂)
Var
= s2 ∑
h2i
ni
h2i
ni
where s2 is the mean square for error with ν degrees of freedom,
e.g. ν = N − a in a CRD.
Examples
1. α1 − α2 is a contrast with h1 = 1, h2 = −h1, h3 = 0i = · · · = h a = 0.
Its estimate is Ȳ1· − Ȳ2· with variance s2 n1 + n12 .
1
2. α3 + α4 + α5 is not a contrast - why? Define a contrast to compare means 3 and 4 and 5.
3. I might want to compare average salaries in small companies
to those in medium and large companies, maybe to answer
the question whether one earns less in small companies. To
do this I would construct a contrast: µsmall − µmed or large =
µsmall − 12 (µmed + µlarge ) (assuming that the number of companies
in each group was the same), i.e. I am comparing/contrasting
two average salaries. The coefficients hi sum to zero: 1 − 12 − 21 .
Sometimes contrasts are defined in terms of treatment totals, e.g.
Yi· , instead of treatment means. We will mainly compare/contrast
treatment means.
Comments on Contrasts
1. Although we have defined contrasts in terms of the α’s, contrasts estimate differences between means, so they are very often
simply called contrasts of means. Essentially they estimate the
differences between groups of means.
2. In designed experiments the means are usually based on the
same number of observations, n. However contrasts can be defined if there are unequal numbers of observations.
3. The estimate for σ2 is given by the Mean Square for Error,
s2 = MSE. Its degrees of freedom depend on the design of
the experiment, and the number of replicates.
57
sta2005s: design and analysis of experiments
4. Many multiple comparison methods depend on Student’s t distribution. The degrees of freedom of t = degrees of freedom of
s2 .
5. The standard error (SE) of a contrast is a measure of its precision
or uncertainty, how well were we able to estimate the difference
in means? It is the standard deviation of the sampling distribution of the contrast.
6. An important contrast is that of the difference between two
means Ȳ1· − Ȳ2· . Its standard error is called the standard error
of the difference s.e.d.
r
s.e.d =
s2 (
1
1
+ )=s
n n
r
2
n
4.2 Orthogonal Contrasts
Orthogonal contrasts are sets of contrasts with specific mathematical properties:
Two contrasts, L1 and L2 where L1 = ∑ h1i µi and L2 = ∑ h2i µi ,
are orthogonal if
∑i h1i h2i = 0.
Orthogonality implies that they are independent, i.e. that they
summarize different dimensions of the data.
Example
An experiment is conducted to determine the wearing quality of
paint. The paint was tested under four conditions:
1. Hard wood dry climate µ1 .
2. Hard wood wet climate µ2 .
3. Soft wood dry climate µ3 .
4. Soft wood wet climate µ4 .
What can you say about the treatment structure? How many
factors are involved? We might want to ask the following questions:
1. Is the average life on hard wood the same as on soft wood?
2. Is the average life in a dry climate the same as in a wet climate?
3. Does the difference between wet and dry climates depend on
whether or not the wood was hard or soft?
58
sta2005s: design and analysis of experiments
These questions can be formulated before we see the results of
the experiment. To answer them we would want to test the following hypothesis:
1. H0 : 21 (µ1 + µ2 ) = 12 (µ3 + µ4 )
or equivalently H0 : 12 µ1 + 12 µ2 − 12 µ3 − 12 µ4 = 0
2. H0 : 21 (µ1 + µ3 ) = 12 (µ2 + µ4 ) ≡ H0 : 12 µ1 + 21 µ3 − 12 µ2 − 21 µ4 = 0
3. H0 : 21 (µ1 − µ2 ) = 12 (µ3 − µ4 ) ≡ H0 : 12 µ1 − 21 µ2 − 12 µ3 + 21 µ4 = 0
This last contrast is testing for an interaction between type of
wood and climate, i.e. does the effect of climate depend on type
of wood (see Chapter 7).
Clearing these contrasts of fractions we can write the coefficients
hki in a table, where hki is the ith coefficient of the kth contrast.
1. Hard vs soft wood
2. Dry vs wet climate
3. Climate effect depends on wood type
h1
1
1
1
h2
1
-1
-1
h3
-1
1
-1
h4
-1
-1
1
Although it is easier to manipulate contrasts that don’t contain
fractions, and hypothesis tests will lead to the same results with or
without fractions, confidence intervals will differ. Keeping the 21 s
will lead to confidence intervals for the difference in means.
Without the fractions, we obtain a confidence interval for 2× the
difference in means. As we do want to understand what these
values tell us, the first (with the 21 s) is much more useful.
Note that ∑41 h1k = ∑41 h2k = ∑41 h3k = 0 by definition of our contrast.
But also ∑41 h1k h2k = 0, i.e. the contrasts are orthogonal. This means
that their estimate will be statistically independent under normal
theory (or uncorrelated if non-normal). You can verify that
contrasts 2 and 3 and 1 and 3 are also orthogonal.
From the four means we have found three mutually orthogonal
(independent) contrasts. In general, given p means we can find
( p − 1) orthogonal contrasts. There are many sets of ( p − 1)
orthogonal contrasts - we can select a set that is convenient to us.
If it so happens that the questions of interest result in a set of
orthogonal contrasts, this will simplify the interpretation of results.
However, it is more important to ask the right questions than to
obtain a set of orthogonal contrasts.
In some cases it is convenient to test contrasts in the context of
analysis of variance. The treatment sums of squares, SS A say, have
( a − 1) degrees of freedom, when a treatments are compared. This
sum of squares can be split into ( a − 1) mutually orthogonal
59
sta2005s: design and analysis of experiments
(independent) sums of squares, each with 1 degree of freedom, each
corresponding to a contrast, so that
SS A = SS1 + SS2 + . . . + SSa−1
We can test for these ( a − 1) orthogonal contrasts simultaneously
within the ANOVA table.
Calculating sums of squares for orthogonal contrasts
For convenience we assume the treatments are equally replicated
(i.e. the same number of observations on each treatment). Let
a = number of treatments,
n = number of replicates per treatment,
Ȳi. = mean for treatment i
SS A = ∑ia=1 ∑n (Ȳi. − Ȳ.. )2 , the treatment SS with ( a − 1) df.
Then:
1. L = h1 Ȳ1. + h2 Ȳ2. + . . . h a Ȳa. is a contrast if ∑1a hi = 0.
2. Var(L) =
s2
n
∑i h2i where s2 = MSE.
3. L1 and L2 are orthogonal if ∑ h1i h2i = 0.
4. Sum of squares for L is
SS L =
nL2
∑ h2i
and has one degree of freedom.
5. If L1 and L2 are orthogonal then SS2 =
SS A − SS1
nL22
∑ h22i
is a component of
6. If L1 , L2 , . . . , L a are ( a − 1) mutually orthogonal contrasts then
the treatment sum of squares, SS A , can be partitioned as
SS A = SS1 + SS2 + . . . + SSa−1
where each SSi has 1 degree of freedom.
7. The hypothesis H0 : Li = 0 versus H1 : Li ̸= 0 can be tested using
SSi
F = MSE
with 1 and ν degrees of freedom where ν = degrees of
freedom of MSE.
8. Orthogonal contrasts can be defined if there are unequal
numbers of replications in each group, but the simplicity of the
interpretation breaks down. With n1 , n2 . . . , n a observations in
each group
L = h1 Ȳ1. + h2 Ȳ2. + . . . + h a Ȳa.
60
sta2005s: design and analysis of experiments
is a contrast iff
n1 h1 + n2 h2 + . . . + n a h a = 0
and L1 and L2 are orthogonal iff
n1 h11 h21 + n2 h12 h22 + . . . + n a h1a h2a = 0
An equal number of replicates for each treatment ensures that
we have meaningful sets of orthogonal contrasts, each of which
will explain some aspect of the experiment independently of any
others. This gives a very clear interpretation of the results. If we
have unequal numbers of replications of the treatments the
different aspects cannot be completely separated.
9. The word “orthogonal" is used in the same sense as in
mechanics where two orthogonal forces ↑→ act independently of
each other. In an a dimensional space these can be seen as
( a − 1) perpendicular vectors.
Example
To compare the durability of different methods of finishing a piece
of mass produced furniture, the production manager set up an
experiment. There were two types of paint available (called A and
B), and two methods of applying it: by brush or spray. Six pieces of
furniture were randomly assigned to each treatment and the
durability of each was measured.
The treatments were:
1. Paint A with brush.
2. Paint B with brush.
3. Paint A with spray.
4. Paint B with spray.
The experimenter is interested in comparing:
1. Paint A with Paint B.
2. Brush Method with Spray Method.
3. How methods compare within the two paints, i.e. is the
difference between brush and spray the same for both paints?
The treatment means were:
61
sta2005s: design and analysis of experiments
Treatment
1
2
3
4
100 120 40 70
The ANOVA table is
Source
Treatments
Error
SS
df
22050 3
2900 20
MS
7350
145
F
50.69
The treatment sum of squares can be split into three mutually
orthogonal contrasts, as shown in the table:
Application Method
Paint
Treatment Means
1.Paints A and B
2.Brush versus Spray
3.Methods within Paints
Brush
Spray
A
B
A
B
100 120 40 70
h1
h2
h3 h4
+1 −1 +1 −1
+1 +1 −1 −1
+1 −1 −1 +1
Treatment SS
4
Li =
∑ hi Ȳi.
SSi =
1
nL2i
∑ h2ij
SSi
F=
SSi
Li
MSE
−50 3750 25.86
110 18150 125.17
10
150
1.03
22050
n=6
∑ h2ij = 4
MSE = 145 with 20 degrees of freedom
Performing the F-tests we see that there is evidence for a difference
in durability between paints (F = 25.86 ∼ F1,20 , p < 0.001), and
between the two application methods
(F = 125.17 ∼ F1,20 , p < 0.001). There is no evidence that the effect
of the application method on durability differs between the two
paints (F = 1.03 ∼ F1,20 , p = 0.68), i.e. there is no evidence that
application method and paint interact (see factorial experiments).
Mean Durability for Paint A
Mean Durability for Paint B
=
=
1
2 (100 + 40)
1
2 (120 + 70)
= 70.00
= 95.00
A 95% confidence interval for the difference in durability between
paint B and A:
r
25 ± t20
2 × 145
= [14.7; 35.3]
12
So, paint B is estimated to last, on average, between 14.7 and 35.3
months longer than paint A. Note the 12 in the denominator when
62
sta2005s: design and analysis of experiments
calculating the standard error: the mean for paint A is based on 12
observations, so is the mean for paint B.
Mean Durability using Brush
Mean Durability using Spray
=
=
1
2 (100 + 120)
1
2 (40 + 70)
= 110.00
= 55.00
Exercise: Construct a confidence intervals for the brush and
interaction contrasts.
Overall the above information suggests that the brush method gives
a more durable surface, and that the best combination for durability
is paint B applied with a brush. The brush method is preferable to
the spray method irrespective of which paint is used.
4.3 Multiple Comparisons: The Problem
We have so far avoided the Neyman-Pearson paradigm for
statistical hypothesis testing. However, for discussing the problem
of multiple comparisons, it can be useful to temporarily revert to
thinking in terms of making a decision based on some
predetermined cut-off level, i.e. reject, or fail to reject H0 .
We know that when we make a statistical test, we have a small
probability α of rejecting the null hypothesis when true (α = Type I
error). In the completely randomised design, the means of the
groups fall naturally into a family, and statements we make will be
made in relation to the family, or experiment as a whole, i.e. we
cannot ignore what other tests have been performed when
interpreting the outcome of any single test. We would like to be
able to control the overall Type I error, also called the
experiment-wise Type I error rate, i.e. the probability of rejecting at
least one hypothesis that is true. Controlling the Type II error
(accepting at least one false hypothesis) is more difficult, as for this
we would need to know the true differences between the means.
What is the overall Type I error? We cannot say exactly but we can
place an upper bound on it.
Consider a family of k tests. Let Ei be the event {the ith hypothesis
is rejected when true}, i = 1, . . . , k, i.e. the event that we make a
type I error in test i. Then for test i, P(Ei ) = αi , the significance level
of the ith test.
∪1k Ei = {At least one hypothesis rejected when true}
and
63
sta2005s: design and analysis of experiments
P(∪1k Ei )
measures the overall probability of a Type I error.
Extending the result P( E1 ∪ E2 ) = P( E1 ) + P( E2 ) − P( E1 ∩ E2 ) to k
events, we have that
P(∪1k Ei ) =
k
∑ P(Ei ) − ∑ P(Ei ∩ Ej ) + ∑
1
i< j
P( Ei ∩ Ej ∩ Ek ) . . . (−1)k P( E1 ∩ . . . ∩ Ek )
i< j<k
An upper bound for this probability can be obtained:
k
P(∪ Ei ) ≤
∑ P(Ei ) = kα
if all
P( Ei ) = α
(4.1)
1
This is called the Bonferroni inequality. The Bonferroni inequality
implies that when conducting k tests, the overall probability of a
Type I error can be as bad as k × α. For example, when conducting
10 tests, each with a 5% significance level, the probability of one or
more Type I errors (wrong decisions) is 50%.
In the rare case of independent tests, we find
P(∪1k Ei )
= 1 − P(∪1k Ei )
= 1 − P(∩1k Ei )
= 1 − ∏1k P( Ei )
Ei ’s are independent
= 1 − ∏1k (1 − αi )
= 1 − (1 − α ) k
if P( Ei ) = α for all i
4.4 To control or not to control the experiment-wise Type I error rate
In experiments, we conduct a large number of tests/comparisons
when comparing treatments. These comparisons can often be
phrased as contrasts. Data, and statistical quantities calculated from
these data (estimates, tests statistics or confidence intervals) are
random outcomes.
Suppose we want to compare several treatments, either in the form
of null hypothesis tests or confidence intervals. Each of these tests,
if the null hypothesis is true, has a small chance of resulting in a
64
sta2005s: design and analysis of experiments
65
type I error (because of the random nature of data). If I conduct
many tests, a small proportion of the results will be Type I errors.
Type I errors lead to false claims and recommendations in health,
science and business, and therefore should be avoided (we can’t
avoid them completely, but we can control / reduce the probability
of making type I errors.)
For example, if I conduct a completely randomised design
experiment with 20 treatments (including control treatments) for
repelling mosquitoes from a room, and I conduct multiple
comparisons to find out which are more effective than doing
nothing (one of the control treatments), and compare them to each
other1 , I will find treatments that seem to be more effective than
others. Some of these findings may be real, some will be
flukes/spurious results. The problem is: how do I know? And the
answer to that is: I can’t be certain, unless I accumulate more
evidence for or against the particular null hypotheses.
We have a dilemma. If we control (using Bonferroni, Tukey, Scheffé
or any other of the many methods available), Type II error rate
increases = power decreases, meaning we could be missing some
interesting differences. If we do NOT control, we could end up
with a large Type I error rate (detect differences that are not real).
Exploratory vs confirmatory studies
The following summarizes my personal opinion on how one can
approach the above dilemma.
Studies (observational or experimental) can often be classified as
exploratory, or confirmatory.
In a confirmatory study you will know exactly what you are
looking for, will have a-priori hypotheses that you want to test, you
are collecting more evidence towards a particular goal, e.g. do
violent video games result in more aggressive behaviour?
Confirmatory studies will mostly have only a few a-priori
hypotheses, and therefore not such a large experiment-wise Type I
error rate. Also, often, the null hypothesis might not be true
(because we are looking to confirm a difference/effect). If the null
hypothesis is false, it is not possible to make a Type I error, and any
method to control for Type I errors would just increase the Type II
error rate.
On the other hand, in exploratory studies we have no clear
expectations, we are just looking for patterns, relationships. Here
we don’t need to make a decision (reject or not a null hypothesis).
But these are often the studies that generate a very large number of
tests, and hence a potentially very large number of Type I errors. If
we control Type I errors, we might be missing some of the
If I conduct all pairwise comparisons,
there are (20
2 ) = 190 tests. More, if I
add other contrasts.
1
sta2005s: design and analysis of experiments
interesting patterns. A solution here is to treat the results with
suspicion/caution: be aware that many of the small p-values may
not be repeatable, and need to be confirmed before you can become
more confident that an effect/difference is present. Use these
studies as hypothesis generating, and follow up with confirmatory
studies. Use p-values to measure the strength of evidence, avoid the
Neyman-Pearson approach.
Declare whether your study is exploratory or confirmatory. This
makes it clear to the reader how to interpret your results.
A big problem is of course if one wants to show that something
DOES NOT HAVE AN EFFECT, e.g. that violent video games do
not increase aggression. Remember that a large p-value would not
mean that the null hypothesis is true or even likely. It could mean
that your power is too low to detect the effect. In this case it is
important to make sure that power is sufficient to detect an effect,
AND remember that large p-values DO NOT mean that there is NO
effect. Here we really want to compare evidence for H0 vs evidence
for H1. This is not possible with p-values or the classical statistical
hypothesis tests. One solution to this is to calculate a likelihood
ratio (related to Bayes factors): likelihood under null hypothesis, vs
likelihood under the alternative hypothesis, as a measure of how
much support the null vs the alternative hypothesis has. This of
course would also need to be repeated, as the likelihood ratio is just
as prone to spurious outcomes as any other statistic.
To find out if a result was a type I error or not (whether an effect is
real or not), one can follow up with other studies. Secondly, it is
important to admit how many tests were conducted, and which of
these were post-hoc, so that it is clear what the chances of type I
errors are. Also consider how plausible the null hypothesis is in the
first place, based on other information and context. If the null
hypothesis is likely (e.g. you would never expect this particular
gene to be linked to a particular disease, e.g. cancer), but your
statistical analyses have just indicated a small p-value, you should
be very suspicious, i.e. suspect a type I error.
Also if we don’t use the Neyman-Pearson approach (i.e. don’t make
a decision), but just report p-values (collecting evidence), we need
to be aware that small p-values can occur even if the null
hypothesis is true (e.g. P( p < 0.05| H0 ) = 0.05).
Each observation can be used either for exploratory analysis or for
confirmatory analysis. Not for both. If we use the same data to
generate and check a hypothesis, the results will be over-optimistic,
they will just confirm what we have already seen in the exploratory
study, and won’t give us an independent test of the hypothesis.
This is the same principle we used in regression when we split our
data into training and testing sets. In experimental studies we
usually don’t have the luxury to split data sets (too small), but we
66
sta2005s: design and analysis of experiments
need to be very careful about how we set up tests, and how we
interpret p-values.
4.5 Bonferroni, Tukey and Scheffé
Having said all of the above and having recommended not to
correct, you should still know that in many fields controlling the
Type I error rate is expected. Bonferroni’s, Tukey’s and Scheffś
methods are the most commonly used methods to control the Type
I error rate, although there are many, many more.
For Bonferroni’s method we need to know exactly how many tests
were performed. Therefore, this method is mainly used for a-priori
hypotheses. Also, it is very conservative (see Bonferroni’s
inequality, it corrects for the worst-case scenario). If the number of
tests exceeds approximately 10, the correction is so severe, that only
the very large effects will still be picked up.
Tukey’s method is used to control the experiment-wise Type I error
rate when all pairwise comparisons between the treatments are
performed.
Scheffés method is used when lots of contrasts, not all of which are
pairwise, are performed.
Many special methods have been devised to control Type I error
rates, also sometimes referred to as false positives. We discuss some
methods for Type I error rate adjustment that are commonly used,
but many others are available for special problems, such as picking
the largest mean, or comparing all treatments to a control (see the
list of references at the end of this chapter).
The methods we will consider here are:
• Bonferroni’s correction for multiple testing
• Tukey’s honestly significant difference (HSD)
• Scheffé’s method
These methods control the experiment-wise type I error rate.
Bonferroni Correction
Bonferroni’s method of Type I error rate adjustment can be quite
conservative (heavy). Therefore, it is only used in cases for which
there is a small number of planned comparisons. The correction is
based on the Bonferroni inequality.
67
sta2005s: design and analysis of experiments
1. Specify m contrasts that are of interest before looking at the
data.
α th
2. Adjust the percentile of the t-distribution. Use the ( 2m
)
α th
percentile instead of the ( 2 ) for each statement (Appendix
Table 7). In other words, the significance level used for each
individual test/contrast/confidence interval is αC = αmE , where
α E is the experiment-wise type I error rate, or the maximum
Type I error rate we are willing to allow over all tests. Often,
α E = 0.05.
The Bonferroni inequality ensures that the probability that all m
intervals cover the true parameters is at least (1 - α), i.e. the
probability of no Type I error.
Confidence intervals:
Given m statements of the form ∑ hi αi , the confidence intervals have
the form
a
∑
( α )
hi Ȳi· ± tν 2m
"
s
1
2
a
∑
1
h2i
ni
#1
2
α
α th
Where t 2m is the ( 2m
) percentile of tν , ν = degrees of freedom for
2
s (MSE, the mean square for error).
For example, if we have decided on five comparisons (a-priori), we
would make each comparison at the (two-sided) 1 % level. Then the
probability that all five confidence intervals cover the true
parameter values is at least 95%.
Hypothesis tests:
H0 : ∑ hi αi = 0
H A : ∑ hi αi ̸ = 0
There are different approaches:
1. Reject H0 if
α
∑ hi Ȳi·
2m
1 > tν
s
∑ h2i
ni
2
2. or, equivalently, calculate each p-value as usual, but then reject
only if p < α E /m, where α E is the experiment-wise type I error
we are willing to allow.
3. adjusted p-value = min( p-value × m, 1)
68
sta2005s: design and analysis of experiments
Tukey’s Method
Tukey was an American mathematician, known for developing the
box plot, the Fast Fourier Transform algorithm and for coining the
term ‘bit’ (binary digit).
Tukey’s method is based on the Studentised Range which is defined
as follows and is another approach to control the overall type I
error rate when performing multiple comparisons.
Let X1 , . . . , Xk be independent N(µ, σ2 ) variables. Let s2 be an
unbiased estimator of σ2 with ν degrees of freedom such that
νs2
∼ χ2ν . Let R = max(xi ) - min(xi ) = range of the { xi }. The
σ2
Studentised range is the distribution of q = Rs . The parameters of
the Studentised range are a (the number of values (xi )) and ν (the
degrees of freedom of s2 ). The upper α% point of q is denoted by
qαk,ν , i.e. P(q ≥ qαk,ν ) = α (see Tables 2 and 3).
Tukey’s method is used when we want to make pairwise
comparisons between a treatment means (µi − µ j ). It corrects
confidence intervals and hypothesis tests for all these possible
pairwise comparisons.
Let’s say we are comparing a total of a treatment means. We
assume that we have the same number of observations per
treatment/group,
n, say. The appropriate standard error will be
q
s2
n,
the SE for a treatment mean. Under the null hypothesis of no
differences between means
P
√
R
s2 /n
≥ qαa,ν
=α
and, by implication, the probability that any difference, under H0 ,
exceeds this threshold is at most α, i.e. at most α × 100% of any
pairwise differences are expected to exceed the α threshold of the
studentised range distribution.
To construct confidence intervals, let ∑ hi αi be a contrast of the form
L = Ȳi. − Ȳj. (with the sum of positive hi ’s = 1 and ∑ hi = 0). Then a
confidence interval adjusted for all possible pairwise comparisons
is
s
L̂ ± qαa,ν √
n
The overall experiment-wise type I error will be α (see Tables 2 and
3 in Appendix). Here s2 = MSE with ν degrees of freedom and qαa,ν
is the αth percentile of the Studentised range distribution.
The (Tukey-adjusted) p-value is calculated as
69
sta2005s: design and analysis of experiments
P q a,ν
L̂
> √
s/ n
70
1. Using Tukey’s method we can construct as many intervals as we
please, either before or after looking at the data. The method
allows for all possible contrasts to be examined.
√
2. All intervals have the same length, which is 2qαa,ν s/ n.
3. Tukey’s method gives shorter intervals for pairwise comparisons
(compared to Scheffé’s method), i.e. it has more power, and is
thus used almost exclusively for pairwise comparisons.
4. Tukey’s method is not as robust against non-normality as
Scheffé’s method.
5. Tukey’s method requires an equal number of observations per
group. For unequal numbers see Spötvoll et al. (1973).
6. Under the Neyman-Pearson approach, the hypothesis
H0 : ∑ hi αi = 0 is rejected if
∑ hi Ȳi·
√s
n
> qαa,ν
or, equivalently, we can define Tukey’s Honestly Significant
Difference as
s
HSD = qαa,ν √ .
n
Then any two means that differ by more than this HSD are said
to be significantly different.
Scheffé’s Method
Scheffé’s method of correcting for multiple comparisons is based on
the F-distribution. It can be used for all contrasts ∑ hi αi . In practice
Scheffé’s method is used if many comparisons are to be done, not
all pairwise. The intervals are
1
a
h2
1
∑ hi Ȳi· ± ((a − 1) Faα−1,ν ) 2 (s2 ∑ nii ) 2
1
The p-value is calculated as
P S>
L̂
SE( L̂)
where S is the Scheffé-adjusted reference distribution with a − 1
and ν degrees of freedom.
This is of the form L̂ ± c × SE( L̂),
where c is the new critical constant.
The last part is the standard error of
the contrast. Also note how the factor
( a − 1) with which the F-quantile is
multiplied will stretch the reference
distribution to the right, making the
observed values less extreme.
sta2005s: design and analysis of experiments
1. Scheffé’s method is better than Tukey’s method for general
contrasts, but intervals are longer for pairwise comparisons.
2. Intervals are longer than Bonferroni intervals, but we do not
have to specify them in advance.
3. Scheffé’s method covers all possible contrasts, but this makes the
intervals longer because protection is given for many cases of no
practical interest.
4. Robust against non-normality of data.
5. Can be used in multiple regression as well as ANOVA, in fact
anywhere where the F-test is used.
6. When the hypothesis of equal treatment means was rejected in
the ANOVA, there will be at least one significant difference
among all possible contrasts. No other method has this property.
7. Under the Neyman-Pearson paradigm, to test H0 : ∑ hi αi = 0
reject if
1
∑ hi Ȳi·
2
1 > (( a − 1) Fa−1,ν )
h2 2
s ∑ ni
i
Example: Strength of Welds
In this experiment the strengths of welds produced by four
different welding techniques (A, B, C, D) were compared. Each
welding technique was used to weld five pairs of metal plates in a
completely randomized design. The average strengths were:
Technique:
Mean:
A
69
B
83
C
75
D
71
The estimate of experimental error variance for the experiment was
MSE = 15 with 16 degrees of freedom.
We are going to use all three methods to control the Type I error
rate on this example, although in practice one would probably only
use 1, depending on the type of contrasts.
Suppose we had planned 3 a-priori contrasts: compare every
technique to C, maybe because C is the cheapest welding technique,
and we will only start using another technique if it yields
considerably stronger welds.
71
sta2005s: design and analysis of experiments
These are pairwise comparisons, but we can still use Bonferroni’s
method. We are going to assume that the number of replicates for
each treatment was the same, i.e. 5 (check this).
We could approach this by constructing a confidence interval for
each contrast/difference. For example, a 95% confidence interval
for the difference between techniques A and C:
1−α/(2×3)
69 − 75 ± t16
×
√
2 × 15/5 = −6 ± 2.67 ×
√
6 = [−12.54, 0.54]
Check that you agree with the standard error, and that you can find
the critical value in table A.8 (appendix to notes).
The above confidence interval tells us that we estimate the true
difference in average weld strength to lie between -12.54 and 0.54
units. Most of the interval is on the negative side, indicating that
there is some evidence that A produces weaker welds, but there is a
lot of uncertainty, and we cannot exclude the possibility that there
is not actually a difference.
Confidence intervals provide much more information than
hypothesis tests, and you should always prefer presenting
information as a confidence interval rather than as a hypothesis test
if possible. But, as an exercise, let’s do the hypothesis test also, first
using the Neyman-Pearson approach (bad) and then Fisher’s
approach (p-value, much better).
H0 : µ A = µC
H1 : µ A ̸= µC although, in this case we might want to test
H1 : µ A > µC (we would get the critical value from the t-tables:
t0.05/3 = t0.017 = 2.318 (qt(0.017, 16, lower.tail = F))).
tobs = √
−6
= −2.449
2 × 15/5
For the two-sided test we reject if |tobs | > t0.05/(2×3) = 2.67 (Table
A.8).
Here, we cannot reject the null hypothesis, i.e. there is no evidence
that techniques A and C differ in terms of average weld strengths.
But note that this does NOT mean that they are the same in
strength. We just don’t know, and might need an experiment with
greater power to find out.
To find the p-value we would use R: 2 * pt(-2.449, 16) =
2 × 0.013 = 0.026. This is the uncorrected p-value. To correct we
multiply by 3 (Bonferroni correction): p = 0.078. Now that we have
the exact (adjusted) p-value, it seems that there is possibly a
difference in weld strength between techniques A and C (little, but
72
sta2005s: design and analysis of experiments
not no evidence). This example shows what happens when
adjusting p-values (controlling the Type I error rate). What you can
conclude about techniques A and C really depends on whether this
was an exploratory or confirmatory study, and what you already
know about these two techniques. In any case, the p-value gives
you a much better understanding of the situation than the
Neyman-Pearson approach above.
Bonferroni’s correction ensures that the overall probability of a
Type I error, over the 3 test conducted, is at most 0.05, but note that
the probability of a Type I error for each individual test (or
confidence interval) is much smaller, namely 0.05/3 = 0.017.
If we had wanted to make all pairwise comparisons between the
treatment means, and wanted to control the experiment-wise Type I
error rate, we would use Tukey’s method. Note that we could also
make all pairwise comparisons without controlling Type I error,
and therefore not use Tukey’s method, i.e. Tukey’s method just
refers to a way of controlling Type I error rates for the particular
case of pairwise comparisons.
Tukey’s method is based on the distribution of the maximum
difference under the null hypothesis that all means come from the
SAME normal distribution, i.e. that there are no differences
between treatments. It uses the studentized range distribution,
which gives the density of the maximum difference (studentized
range). In other words, it defines how often the maximum
difference (under H0) will exceed a certain threshold. For example,
if the maximum difference exceeds a value c only with a 5%
probability, that tells us that 95% of the time the maximum
observed difference (in a sample of size a, under H0) should be less
than c, and hence the probability that ANY difference should be
less than c is 0.95, and, voila, we have fixed the experiment-wise
type I error at a maximum of 5%.
The weld example had 4 treatments. This makes (42) = 6 possible
different pairwise comparisons. There is a trick to do this, which
unfortunately only works for the Neyman-Pearson approach: Sort
the means from smallest to largest; calculate the HSD (honestly
significant difference); any difference between any two means
which exceeds the HSD is declared ‘significant’.
Technique:
Mean:
HSD = q0.05
4,16 ×
√
A
69
D
71
C
75
15/5 = 4.05 ×
B
83
√
3 = 7.01
√
Note: The 15/5 part here is NOT the standard error of a
difference, but refers to the standard deviation of a mean (in the
73
sta2005s: design and analysis of experiments
studentized range distribution the standard deviation of the values
being compared). The critical value can be found in Table A.3 (see
the notes below Table A.3 to help you understand what the rows
and columns refer to).
We now go back to our sorted means, and draw a line under means
which are NOT significantly different:
Technique:
Mean:
A
69
D
71
C
75
B
83
no sign. diff. between A and D, and A and C
B is sign. diff. from C, and hence from all others
This can be interpreted as follows: There is evidence that B is
stronger than A, D, and C, but there is no evidence that A, D and C
differ in mean weld strength. It is unlikely (< 0.05) that even the
largest difference in a sample of means would exceed 7.01 under
H0, so we now have fairly strong evidence that B produces stronger
welds than the other 3 welding techniques.
If our means had been slightly different (C = 77), this whole
procedure would change to:
Technique:
Mean:
A
69
D
71
C
77
B
83
This can then be interpreted as follows: There is evidence that B is
stronger than A and D, but not C, C is stronger than A, but not D,
and no evidence that A and D differ in strength.
Lastly, we will briefly use Scheffé’s method to construct a
confidence interval for the difference between B and the other 3
techniques. Scheffé’s method for controlling Type I error rates is
used when we have lots of contrasts (explicit or implicit), and they
are not only the pairwise comparisons, but may be more
complicated, as the one above. We wouldn’t use Scheffé’s method if
the contrast above was the only contrast we were going to test, but
usually, once you have invested time and money into conducting an
experiment, you might as well try and get all the information you
can from it (even if you only use the results to generate hypotheses
for a future experiment).
Scheffé’s method is very similar to a t-test, except that the critical
value is not from a t-distribution but rather
q
( a − 1) Faα−1,ν
74
sta2005s: design and analysis of experiments
where a is the total number of treatment means in the experiment,
and ν is the error degrees of freedom. The factor ( a − 1) has the
effect of shifting the critical region to the right, i.e. reducing type I
error of the individual test, but also reducing power of the
individual test.
A 95% confidence interval for the difference between B and the
average of the other 3 techniques is found as follows:
L̂ ± c × SE( L̂)
Let’s first find the standard error =
√
Var of the contrast:
µ̂ + µ̂ D + µ̂C
Var ( L̂) = Var µ̂ B − A
3
15 1
15
+
=
3×
5
9
5
=4
Then a 95% confidence interval:
(83 −
q
√
69 + 71 + 75
) ± (4 − 1) F40.05
× 4
−
1,16
3
√
= 11.33 ± 3 × 3.24 × 2
= [5.095; 17.57]
The F-value is from the usual F table (Table A.6). a is the total
number of treatment means involved in the contrasts, and ν is the
error degrees of freedom (from the ANOVA table).
The above confidence interval tells us that mean weld strength with
technique B is estimated to be between 5.1 and 17.6 units stronger
than the average strength of the other 3 techniques. From this
confidence interval we can learn how big the difference is, and how
much uncertainty there is about this estimate. The whole range of
the confidence is positive, which indicates that B results in stronger
welds than the other 3 techniques on average. This confidence
interval gives us the same conclusion as pairwise comparisons did
above, it is just answering a slightly different question. In other
words, the contrast should depend predominantly on the question
you want to answer.
4.6 Multiple Comparison Procedures: The Practical Solution
The three methods discussed above all control for the significance
level, α, and protect against overall Type I error. Controlling the
75
sta2005s: design and analysis of experiments
significance level causes some reduction in power and increases the
Type II error (accepting at least one false hypothesis). Very little
work has been done in multiple comparison methods that protect
against a Type II error (accepting at least one false hypothesis). A
practical way to increase power (i.e. lower the probability of
making a Type II error) is by raising the significance level, e.g. use
α = 10%.
The paper by Saville (Saville, DJ. 1990. Multiple Comparison
Procedures: The Practical Solution. The American Statistician 44:
174–180. http://www.jstor.org/stable/2684163), provides a good
overview of the issues in multiple and unplanned comparisons, and
gives practical advice on how to proceed. We recommend Saville’s
approach for this course and also for your future work as a
statistician.
The problem of multiple testing does not only occur when
comparing means from experimental data. It also occurs in
stepwise procedures when fitting regression models, or whenever
comparing a record to many others, e.g. DNA records in
criminology, medical screening for a disease, and many other areas.
Behind all of this is uncertainty, randomness in data. Humans want
certainty, but this can never be achieved from a single experiment,
or a single data point or even a large single data set, because of the
inherent nature of data; we can approach certainty only by
accumulating more and more evidence.
Using a Neyman-Pearson approach to hypothesis testing can make
the problem worse: By being forced to make a decision, one
automatically makes an error every now and then, and these errors
accumulate as more decisions are made.
Once should take a sceptical approach and never treat results from
a single experiment or data set as conclusive evidence. Always keep
in the back of your mind how many tests were done and that some
of the small p-values will be spurious results that happened
because of some chance outcome in the particular data set. Discuss
your results in this light, including the above warning. Especially in
the case where we look for patterns or interesting differences
(unplanned comparisons), i.e. outcomes we didn’t expect, we
should remain sceptical until further experiments confirm the same
effect. In other words, interesting results found in an exploratory
study, can at most generate hypotheses that need to be corroborated
by replicating results.
On the other hand, when we have an a-priori hypothesis that we
can test with our data, we take a much greater step in the process
towards knowing if an effect is real or not: therefore the importance
of a-priori hypotheses (planned before having seen the data).
76
sta2005s: design and analysis of experiments
4.7 Summary
You now know how to answer specific questions about how
treatments compare. And how to interpret these results considering
the problem of multiple comparisons, and the plausibility of the
null hypotheses you are testing. This is practically the most
important part of the experiment: HOW do the treatments differ,
and what can you actually learn and say about the treatments
(considering all the uncertainty involved in data and statistical tests
and estimates).
The methods in this chapter do not only apply to completely
randomized designs, but to all designs. The aim of most
experiments is to compare treatments.
One situation where you would not compare treatments in the way
we have done above, is where the levels of the treatments are levels
of a continuous treatment factor, e.g. you have measured the
response at temperature levels 20, 40, 60, 80 degree Celsius. It
would not make sense to do all pairwise comparisons here. Rather
you would want to know how the response changes as a
(continuous) function of temperature. This is done using special
contrasts called orthogonal polynomials.
4.8 Orthogonal Polynomials
In experiments, the treatments are often levels of a quantitative
variable, e.g. temperature, or amount of fertilizer. In such a case
one might be interested in how the response changes with
increasing X, similarly as we do in linear regression. Suppose the
levels are equally spaced, such as 10◦ , 20◦ , 30◦ in the case of
temperatures, and that there is an equal number of replications for
each treatment. The mean response Y may be plotted against X, and
we may wish to test if there is a linear or quadratic or cubic
relationship between them. These relationships can be described by
polynomials (linear, quadratic, third order, etc.). For the analysis of
experiments this is often done using orthogonal polynomials.
In regression you will have come across using polynomial terms to
account for non-linear relationships. Orthogonal polynomials have
the same purpose as polynomial regression terms. They have the
advantage of being orthogonal, which means that the terms are
independent. This avoids problems of collinearity, and allows us to
identify exactly which component(s) of the polynomial are
important in describing the relationship.
The hi coefficients to construct these orthogonal polynomials can be
found in Table 4.1. They are used to define linear, quadratic, cubic
77
sta2005s: design and analysis of experiments
polynomial contrasts in the treatment means. We can test for the
presence of each of these contrasts using an F-test, similarly as for
orthogonal contrasts. In effect they create orthogonal contrasts
which test for the presence of a linear component, a quadratic
component, etc..
If the number of treatments are not equally spaced or the number
of observations differs between treatment levels the table cannot be
used, but there is a regression approach that achieves exactly the
same: split the relationship into linear, quadratic, etc. components.
The main objective in using orthogonal polynomials is to find the
lowest possible order polynomial which adequately describes the
relationship between the treatment factor and the response.
No. of Levels
3
4
5
6
7
8
Order
1
2
1
2
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
-1
+1
-3
+1
-1
-2
+2
-1
+1
-5
+5
-5
+1
-3
+5
-1
+3
-7
+7
-7
+7
Ordered Treatment Number
2
3
4
5
6
7
0 +1
-2 + 1
-1 + 1 + 3
-1 - 1 + 1
+3
-3 + 1
-1
0 + 1 +2
-1
-2
-1 +2
+2
0
-2 +1
-4 +6
-4 +1
-3
-1 + 1 +3 +5
-1
-4
-4 -1 +5
+7 +4
-4 -7 +5
-3 +2 +2 -3 +1
-2
-1
0 +1 +2 +3
0
-3
-4 -3
0 +5
+1 +1
0 -1 -1 +1
-7 +1 +6 +1 -7 +3
-5
-3
-1 +1 +3 +5
+1
-3
-5 -5 -3 +1
+5 +7 +3 -3 -7
-5
-13
-3 +9 +9 -3 -13
Table 4.1: Orthogonal Polynomial
Coefficients
8
+7
+7
+7
+7
Example: If we have 4 treatments we can construct 3 orthogonal
polynomials, and can thus test for the presence of a linear,
quadratic and cubic effect. We would construct 3 orthogonal
contrasts as follows:
L1 = −3Ȳ1. − 1Ȳ2. + 1Ȳ3. + 3Ȳ4.
L2 = +1Ȳ1. − 1Ȳ2. − 1Ȳ3. + 1Ȳ4.
L3 = −1Ȳ1. + 3Ȳ2. − 3Ȳ3. + 1Ȳ4.
L1 is used to test for a linear relationship, L2 for a quadratic effect,
Divisor
2
6
20
4
20
10
14
10
70
70
84
180
28
28
84
154
154
168
168
264
616
λ
1
3
2
1
10/3
1
1
5/6
35/12
2
3/2
5/3
7/12
2
1
1/16
7/12
2
1
2/3
7/12
78
sta2005s: design and analysis of experiments
etc. (Order 1 = linear; order 2 = quadratic; order 3 = cubic; order 4 =
quartic).
For calculating corresponding sums of squares: Denote the
coefficients by k ij (obtained from the table above). The divisor for
p
calculating sums of squares is D = ∑1 k2ij (dot-product, also given in
nL2
table). SSi = Di is the sum of squares associated with the ith order
SSi
, with n the number of
term. The test statistic is F = MSE
observations per treatment. This tests H0 : Li = 0.
relative weight
λ = factor used to convert coded coefficients to regression
coefficients. We will not be using this.
0.2
0.1
0.0
−0.1
−0.2
Suppose we want to test for the presence of a linear effect. Then
H0 : Llinear = 0, i.e. no linear effect is present. A large L̂linear
together with a small p-value suggest the presence of a linear effect.
Often, we want to find the lowest order polynomial required to
adequately describe the relationship. This is a principle in statistical
modelling called parsimony: find the simplest model with the fewest
variables and assumptions which has the greatest explanatory
power.
For this we investigate ‘lack of fit’ after fitting each term
sequentially, to check whether we need any more higher-order
terms.
1. Compute SSlin (for the linear effect)
2. Compute SS LOF = SS A − SSlin with ( a − 1) − 1 df. LOF = lack of
fit. SS LOF measures the unexplained variation (lack of fit) after
having fitted the linear term.
3. Compare SS LOF to MSE:
SS LOF /( a − 2)
∼ Fa−2,ν
MSE
If there is evidence of lack of fit (large F, small p), we need more
terms (add another term in the polynomial). If not, stop.
4. Compute SSquad .
5. Compute SS LOF = SS A − SSlin − SSquad with ( a − 1) − 2 df.
6. Test lack of fit etc.
Example
The following data are from an experiment to test the tensile
strength of a cotton fibre used to manufacture men’s shirts. Five
different qualities of fibre are available with percentage cotton
x
Figure 4.1: The first six orthogonal
polynomial basis functions
79
sta2005s: design and analysis of experiments
contents of 15%, 20%, 25%, 30% and 35% and five measurements
were obtained from each type of fibre.
The treatment means and ANOVA table are given below:
cotton %
mean
15
9.8
20
15.4
25
17.6
30
21.6
35
10.8
--------------------------------------------Df Sum Sq Mean Sq F value Pr(>F)
cotton
Residuals
4
476
118.9
20
161
8.1
14.8 9.1e-06
---------------------------------------------
Percentage cotton has a significant effect on strength. We now
partition the treatment sum of squares into linear, quadratic and
cubic effects using the coefficients table:
mean
.L
.Q
.C
.4
15
9.8
-2
2
-1
1
20
15.4
-1
-1
2
-4
% Cotton
25
30
17.6 21.6
0
1
-2
-1
0
-2
6
-4
35
10.8
2
2
1
1
D
10
14
10
70
Li
8.2
-31
-11.4
-21.8
sum
SSi
33.6
343
65
33.9
476
F
4.15
42.35
8.02
4.19
The null hypothesis for the .L line (linear effect) is H0 : there is no
linear component in the relationship between % cotton and
strength.
1. H0 : no linear effect of % cotton
lin
using F = SS
MSE = 4.15. Comparing with F1,20 , we conclude that
there is some evidence for a linear effect of % cotton (p = 0.055).
2. To test whether we need any higher order terms other than a
linear term, we do a lack of fit test:
H0 : All higher order terms are zero:
we find
SS LOF = SS A − SSlin = 476 − 33.6 = 442.4 with 3 df
These are the unexplained sums of squares, all that the linear
term cannot explain. The F-statistic is
F=
442.4
MS LOF
= 3 = 18.21 ∼ F3,20
MSE
8.1
p
0.055
< 0.001
0.010
0.054
80
sta2005s: design and analysis of experiments
81
0.05 = 18.3, p < 0.001. There is strong evidence for lack of fit. So
F3,20
we add a quadratic term, and repeat.
Here is a table that summarises the lack-of-fit tests. The null
hypothesis tested in each line is that there is no lack of fit after
having added the corresponding term (and all preceding), i.e. all
higher order contrasts equal 0.
----------------------------------------SS.lof df.lof
linear LOF
442.140
F.lof p.lof
3 18.285 0.000
quadratic LOF
98.926
2
6.137 0.008
cubic LOF
33.946
1
4.212 0.053
0.000
0
quartic LOF
NaN
NaN
-----------------------------------------
25
●
●
●
20
●
●
strength
3. We need a linear, quadratic and cubic term. We always keep all
lower-order terms in the model! There is little evidence that the
quartic effect improves the fit, so for simplicity (parsimony) we
prefer the simpler cubic model. Note that the cubic relationship
(regression equation) is a linear combination of the intercept,
linear, quadratic and cubic term. To describe the relationship
between tensile strength and percentage cotton content we use a
cubic polynomial (see fitted curve in Figure 4.2).
●
●
●
15
●
●
●
●
●
●
10
●
●
●
4.9 References
1. Abdi, H. and Williams, L. 2010. Contrast Analysis. In: Salkind
N. (Ed.), Encyclopedia of Research Design. Sage. (This is a very
good overview of contrasts).
https://www.utd.edu/~herve/abdi-contrasts2010-pretty.pdf.
2. Miller, R.G (Jnr.) (1981). Simultaneous Statistical Inference. 2nd
edition. Springer.
3. Dunn. O.J. and Clarke, V. (1987). Applied Statistics: Analysis of
Variance and Regression. Wiley.
4. O’Neil, R.O. and Wedderburn, G.B. (1971). The Present State of
Multiple Comparisons. Journal of the Royal Statistical Society
Series B, 33, 218–244.
5. Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison
Procedures. John Wiley and Sons.
6. Peteren, R. (1986). Designs and Analysis of Experiments. Marcel
Dekker.
7. Spjøtvoll, E. and M. R. Stoline (1973). An Extension of the
T-Method of Multiple Comparison to Include the Cases with
Unequal Sample Sizes. Journal of the American Statistical
Association 68, 975–978.
15
●
20
25
30
35
percentage cotton
Figure 4.2: Tensile strength of cotton
fibre with respect to percentage cotton.
Dots are observations. The line is a
fitted cubic polynomial curve.
sta2005s: design and analysis of experiments
8. Ruxton, G. D. and G. Beauchamp (2008). Time for some a priori
thinking about post hoc testing. Behavioral Ecology 19, 690–693.
9. Tukey, J. W. (1949). Comparing individual means in the analysis
of variance. Biometrics 5, 99–114.
10. Scheffé, H. (1953). A method for judging all contrasts in the
analysis of variance. Biometrika 40, 87–104.
11. Scheffé, H. (1959) The Analysis of Variance. Wiley.
12. Saville, D.J. (1990) Multiple comparison procedures: the practical
solution. The American Statistician 44, 174–180.
82
5
Randomised Block and Latin Square Designs
We have seen how to analyse data from a single factor completely
randomised design, using a one-way ANOVA. Completely
randomized designs are used when the experimental units are
homogeneous or similar. In this chapter we will look more closely
at designs which have used blocking factors (one or two), but still a
single treatment factor. Recall that blocking is done to reduce
experimental error variance. This is done by separating the
experimental units into blocks of similar (homogeneous) units. This
makes it possible to account for the differences between the blocks,
thereby reducing the experimental (remaining) error variance. Any
differences in experimental units which are not blocked for or
measuered will end up in the error variance.
Natural blocks may be:
1. A day’s output on a machine, a batch of experimental material.
2. Age or sex of subjects in the experiment.
3. Animals from the same litter; people from same town.
4. Times at which observations are made.
5. The positions of the experimental units when they occur along a
gradient in spatial settings (e.g. from light to dark, or from lots
of nutrients to few nutrients).
The experimental units within a block should be homogeneous so
that ideally the only thing that can affect the response will be the
different treatments. The treatments are assigned at random to the
units within each block so that a given unit is equally likely to
receive any of the treatments. Randomization minimizes the effects
of other factors that may influence the result, but which may not
have been blocked out. One does not usually test for block
differences - if the blocking was successful, the F statistic will be
greater than 1.
sta2005s: design and analysis of experiments
84
Typical blocking factors include age, sex, material from the same
batch, time (e.g. week day, year), spatial gradients.
Ideally (easiest to analyse and interpret results), we would have
each treatment once in every block. This means that each block has
the same number of experimental units. Often it is worth choosing
the experimental units in such a way that we can have such a
complete randomized block design.
Randomization is not complete but restricted to each block. This
means we randomly assign the a treatments to the a experimental
units in block 1, then randomize in block 2, etc..
The main purpose of blocking is to reduce experimental error
variance (unexplained variation). This increases power and
precision of estimates. If there is lots of variation between blocks,
this variation, that would otherwise end up in the experimental
error variance, is now absorbed into the block effects (variation due
to blocks). Therefore, experimental error variance can be reduced
considerably if there are large differences between the blocks.
If there are only small differences between blocks, error variance
will not decrease very much. Additionally we will loose error
degrees of freedom, and may end up with less power. So it is
important to consider carefully at the design stage whether blocks
are necessary or not.
Model
Yij = µ + αi + β j + eij ,
eij ∼ N (0, σ2 )
∑ αi = ∑ β j = 0
β j is the effect of block j, i.e. the change in mean response with
block j relative to the overall mean. Again, the (identifiability)
●
<−−−−−− gradient −−−−−−−
E
C
A
D
B
D
A
E
●
C
E
●
A
B
C
●
D
E
B
D
E
C
B
●
8
6
●
C
D
●
4
0:10
B
●
A
●
2
The experimental units on the light blue side of the field are more
similar and thus grouped into a block, the experimental units
(plots) on the dark blue side are grouped together. Experimental
units within blocks are similar (homogeneous), but differ between
blocks.
0
Here is an example that could represent an agricultural experiment.
Typically in agriculture one needs to use fields that are not
homogeneous. For example, the one side is higher up, has fewer
nutrients, less water-logged. Agricultural experiments almost
always use designs with blocks.
10
Randomized block designs are used when the experimental units
are not very homogeneous (similar). Similar experimental units are
grouped together in blocks.
A
●
●
0
2
4
6
8
10
0:10
Figure 5.1: Randomized Block Design
sta2005s: design and analysis of experiments
constraints are required because the model is over-parametrized;
they ensure unique estimates.
There are two subscripts: i refers to the treatment, j refers to the
block (essentially the replicate). The block effect is defined exactly
like we defined the treatment effect before: the difference between
the block mean and the overall mean, i.e. the change in
average/expected response when the observation is from block j
relative to the overall mean.
E(Yi j) = µ + αi + β j
i.e. the observation from treatment i and block j is made up of an
overall mean, a treatment i effect and a block j effect (and the effect
of treatment i does not depend on what block we have, i.e. the
effects are additive = there is no interaction between the blocking
and the treatment factor).
If we take µ over to the LHS of the equation, the deviation of the
observed value from the overall mean equals the sum of 3 effects
(or deviations from a mean): treatment effect, block effect and
experimental unit effect (or error term).
If we assume that the effect of treatment i is the same in every
block, then the average (over all blocks j) of the Yij − Ȳ.j deviations
will give us an estimate of the effect of treatment i. If we cannot
make this assumption, i.e. the effect of treatment i depends on the
block, or is different in every block (there is an interaction between
the blocking and the treatment factor), then the treatment effect in
block j is confounded with the error term of the observation
(because there is no replicate of treatment i in block j).
The above is the central, crucial idea to understanding how
randomized block designs work, how we get the estimates and why
the no interaction assumption plays a role.
Here is another way to look at it. Assume we have just two blocks
and 2 treatments.
Assume the effect of treatment 2 is the same in every block (here
slightly increases response). Then we can estimate α2 by taking the
average effect of treatment 2 (relative to block means). It turns out
that we could also get α2 from Y2j − Ȳ.. , because ∑ β j = 0.
Also look at a sketch of a data table:
If you sum the values in the first column (treatment 1), you would
be estimating (µ + α1 + β 1 ) + . . . + (µ + α1 + β b ) = bµ + bα1 + 0
(because ∑ β j = 0, see the identifiability constraint in the model).
And the mean would be estimating µ + α. Therefore, we could
estimate al phai by Ȳi . − Ȳ.. , as before. Try the same for row one.
85
sta2005s: design and analysis of experiments
block
1
treatment
2
3 ...
a
Ȳ.1
Ȳ.2
1
2
..
.
b
mean
mean
Ȳ.b
Ȳ1.
Ȳ2.
Ȳa.
Ȳ..
In this we are assuming that αi is the same in every block. al phai
essentially gives us an average estimate of how the response
changes with treatment i. With no replication of treatments within
blocks, this is all we can do. We CANNOT estimate a separate
effect of treatment i for every block.
So, when we use a randomized block design, we need to make an
important assumption, namely that the effect of treatment i, αi , is
the same in every block. Technically we refer to this as there is no
interaction between the treatment and the blocking factors, or the effects
are additive.
What happens if block and treatment factors DO interact? The
model would still need to make the assumption of no interaction,
our estimates would still calculate average effects, but this might
not be very meaningful, or not a very useful description of what
happens if we use treatment i in block j. Also, the residuals and
thus the experimental error variance might become quite large,
because the observations deviate quite a bit from the average
values.
Additivity of block and treatment effects is therefore another thing
we need to check, in addition to the previous normal, equal
variance residual checks. A good way to check for this is through
an interaction plot.
Sums of Squares and ANOVA
Yij = µ + αi + β j + eij
Again start with the model. Take µ to the left hand side. Then all
terms on the RHS are deviations from a mean: treatment means
around overall mean, block means around overall mean,
observations around µ + αi + β j .
As we did for CRD, we can substitute observed values, square both
sides and sum over all observations to obtain
SStotal = SStreatment + SSblocks + SSerror
Table 5.1: Sketch of data table for
randomized block design.
86
sta2005s: design and analysis of experiments
with
ab − 1 = ( a − 1) + (b − 1) + ( a − 1)(b − 1)
degrees of freedom, respectively. The error degrees of freedom can
just be calculated from the rest: ( ab − 1) − ( a − 1) − (b − 1).
Note that SSblocks = ∑ ∑(Ȳ.j − Ȳ.. )2 , where Ȳ.j denotes the mean in
block j.
Check that you can see why
rij = Yij − α̂i − β̂ j − µ̂
and
SSE =
∑ ∑(Yi j − Ȳ.j − Ȳi. + Ȳ.. )2
When we have data from a completely randomized design we do
not have a choice about which terms need to be in the model.
But just to illustrate how the blocks reduce the SSE, compare the
model with block effects
SStotal = SStreatment + SSblocks + SSerror
to a model we would use for a single-factor CRD (essentially
ignoring that we actually had blocks):
SStotal = SStreatment + SSerror
The SStotal and SStreatment will be exactly the same in both models.
If we add block effects the SSE is reduced, but so are its degrees of
freedom. MSE will only become smaller if the reduction in SSE is
large relative to the number of degrees of freedom lost.
Usually we are not interested in officially testing for block effects.
Actually the F-test is not quite valid for block effects because we
haven’t randomized the blocks to experimental units. If we want to
test for differences between blocks, we can use the F-test, but
remember that we cannot make causal inference about blocks. If we
are only interested in whether blocks have reduced the MSE, we
can look at the F-value for blocks: blocking has reduced the MSE iff
F > 1 (iff = if and only if).
87
sta2005s: design and analysis of experiments
88
Example
Executives were exposed to one of 3 methods of quantifying the
maximum risk premium they would be willing to pay to avoid
uncertainty in a business decision. The three methods are: 1) U:
utility method, 2) W: worry method, 3) C: comparison method.
After using the assigned method, the subjects were asked to state
their degree of confidence in the method on a scale from 0 (no
confidence) to 20 (highest confidence).
Block
1
2
3
4
5
(oldest executives)
(youngest executives)
Experimental Unit
1
2
3
C W
U
C U
W
U W
C
W U
C
W C
U
You can see that the experimenters blocked for age of the
executives. This would have been a reasonable thing to do if they
expected, for example, lower confidence in older executives, i.e.
different response due to inherent properties of the experimental
units (which here are the executives).
We have a randomized block design with blocking factor age,
treatment factor method of quantifying risk premium, response =
confidence in method. The executives in one block are of a similar
age. If the experiment was conducted correctly, the three methods
were randomly assigned to the three experimental units in each
block.
Here is the ANOVA table. NOTE: one source of variation (SS) for
every term in the model!
-----------------------------------------------> m1 <- aov(rate ~ block + method)
> summary(m1)
Df
Sum Sq Mean Sq F value
block
4 171.333
method
2 202.800
Residuals
8
23.867
42.83
101.4
Pr(>F)
14.37
0.0010081
34.03
0.0001229
2.98
------------------------------------------------
We are mainly interested in the treatment factor method. We can use
the ANOVA table to test whether method of quantifying risk affects
confidence. For this test we set up H0 : α1 = α2 = α3 = 0 (method
has no effect on confidence). The result of this tests suggests that
average confidence differs with different methods (p = 0.0001).
Table 5.2: Layout and randomization
for premium risk experiment.
sta2005s: design and analysis of experiments
89
What about the block effects (age)? There is evidence for differences
in confidence between the age groups (p = 0.001). And because the
F-value is much larger than 1, we know that we haven’t wasted
degrees of freedom, i.e. by blocking for age we have been able to
reduce experimental error variance, and thus to increase power to
detect the treatment effects. When interpreting block effects we are
only allowed to talk about differences, not about causal effects! This
is because we have not randomly assigned age to the executives,
and age could be confounded with many other unknown factors.
We can talk about association, but not causation.
Is it reasonable to assume that block and treatment effects are
additive? The interaction plot can give some indication of how
acceptable this assumption is. Usually we plot the treatments on
the x-axis and each block is represented by a line or trace. On the
y-axis we show the mean response for treatment i and block j.
Because in RBDs there is mostly only a single observation for
treatment i and block j, the point shown here just represent the
observations.
If block and treatment do not interact, i.e. method and age do not
interact, the lines should be roughly parallel. Because no interaction
means that the effect of method i is the same in every block, and
also when changing from method 1 to 2 the change in mean
response should be roughly the same. HOWEVER, remember that
here the points are based on single observations, and we still expect
some variation between executives, as part of natural variation
between experimental units.
Figure 5.2: Interaction plot
for risk premium data. In R:
block
10
5
confidence rating
15
4
5
3
2
1
1
2
3
method
Even though the lines are not parallel here, there is no indication
that the effect of method is very different in the different blocks. We
also did not (and could not!) show that the residuals ARE normal,
we are only worried if they are drastically non-normal. Here we are
only worried if there are clear indications of interactions. There are
interaction.plot(method, block,
rate, cex.axis = 1.5, cex.lab =
1.5, lwd = 2, ylab = "confidence
rating"), first the factor that goes on
the x-axis, then the trace factor, then
the response.
sta2005s: design and analysis of experiments
90
not, and averaging the effects over blocks would give us a
reasonable indication of what happens when the different methods
of risk assessment are used.
Moving beyond the ANOVA, we might now want to compare the
treatment means directly to find out which method results in
highest confidence, and to find out HOW BIG the differences are.
We can do this exactly as we did for CRDs. For example if we
compare two treatment means we need the standard error of a
treatment mean, and the standard error of a difference between
treatment means.
SE(Ȳi. ) =
q
r
Var (Ȳi. ) =
2.98
= 0.77
5
This is a measure of the uncertainty of a specific mean (how close it
is to the true treatment mean, or how well it estimates the true
treatment mean). The variance of repeated observations from this
particular treatment is estimated by MSE. Each treatment mean is
based on 5 observations (one from each block). The standard error
for the difference between two means:
r
SED =
2 × MSE
=
5
r
2 × 2.98
= 1.09
5
5.1 The Analysis of the RBD
Suppose we wish to compare a treatments and have N experimental
units arranged in b blocks each containing a homogeneous
experimental units: N = ab. The a treatments, A1 , A2 , . . . A a say are
assigned to the units in the jth block at random.
Let Yij be the response to the ith treatment in the jth block. The
linear model for the RBD is:
Yij
= µ + αi + β j + eij
i = 1 . . . a and
j = 1...b
where
In general, a statistical model should
be as simple as possible, and only as
complicated as necessary (Occam’s
razor, parsimony). In ED however,
the structural part of the model is
dictated by the design, and not much
can be added or changed (apart form
interaction terms).
sta2005s: design and analysis of experiments
∑ia=1 αi = ∑bj=1 β j = 0
µ
αi
βj
eij
eij ∼ N (0, σ2 )
overall mean
effect of the ith treatment
effect of the jth block
random error of observation
and are independent
This model says that the response depends on a treatment effect, a
block effect and the overall mean. It also says that these effects are
additive. In other words we now have a × b
distributions/populations corresponding to the a treatments in each
of the b blocks. The means of these a × b populations are given by:
µ + αi + β j ≡ the property of additivity (no interaction)
The variance is assumed equal in all of these populations.
What do we mean by additive effects? It means that we are
assuming that the effect of the ith treatment on the response is the
same (αi ) regardless of the block in which the treatment is used.
Similarly, the effect of the jth block is the same (β j ) regardless of the
treatment.
If the additivity assumption is not valid, the effect of treatment i
will differ depending on block. The response can then not be
described as in the model above but we need another term, the
interaction effect, which describes the difference in effect of
treatment i in block j, compared to the additive model. To be able
to estimate these interaction effects we need at least 2 replications
of each treatment in each block. In general, for randomised block
designs, we make the assumption of additivity, but then need to
check this.
Estimation of µ, αi (i = 1, 2, . . . a) and β j (j = 1, 2 . . . b)
When assuming a normally distributed error term, the maximum
likelihood and least squares estimates of the parameters are the
same and are found by minimizing
S = ∑i ∑ j (Yij − µ − αi − β j )2
Differentiate with respect to µ, αi and β j and set equal to 0:
91
sta2005s: design and analysis of experiments
∂S
∂µ
= −2 ∑ij (Yij − µ − αi − β j )
∂S
∂αi
= −2 ∑bj=1 (Yij − µ − αi − β j ) = 0 i = 1, . . . a
∂S
∂β j
= −2 ∑ia=1 (Yij − µ − αi − β j )
=0
=0
j = 1, . . . b
Note the limits of the summation. Using the constraints we find the
a + b + 1 normal equations
abµ
bαi + bµ
aβ j + aµ
= Y··
= Yi·
= Y· j
whose unique solution is
µ̂
α̂i
β̂ j
= Ȳ··
= Ȳi· − Ȳ··
= Ȳ· j − Ȳ··
i = 1...a
j = 1...b
The unbiased estimate of σ2 is found by substituting these estimates
into SSE to give
SSresidual
= ∑ij (Yij − µ̂ − α̂i − β̂ j )2
= ∑ij (Yij − Ȳ·· − (Ȳi· − Ȳ·· ) − (Ȳ· j − Ȳ·· ))2
= ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ·· )2
and
σ̂2
=
Sresidual
( a−1)(b−1)
= ∑ij
(Yij −Ȳi· −Ȳ· j +Ȳ·· )2
( a−1)(b−1)
Parameter
Point Estimate
Variance
µ
Ȳ··
αi
Ȳi· − Ȳ··
σ2
ab
σ 2 ( a −1)
ab
σ 2 ( b −1)
ab
2σ2
b
a 2
σ2
b ∑i hi
2
σ ( a + b +1)
ab
βj
αi − αi ′
∑ hi αi
µ + αi + β j
σ2
Ȳ· j − Ȳ··
Ȳi· − Ȳi′·
with ∑ hi = 0 ∑i hi Ȳi·
Ȳi· + Ȳ· j − Ȳ··
s2
92
sta2005s: design and analysis of experiments
Analysis of Variance for the Randomised Block Design
The model is
Yij = µ + αi + β j + eij
so
Yij − µ = αi + β j + eij
Replacing the parameters by their estimates gives
Yij − Ȳ··
= (Ȳi· − Ȳ·· ) + (Ȳ· j − Ȳ·· ) + (Yij − Ȳi· − Ȳ· j + Ȳ·· )
Squaring and summing over i and j gives
∑ij (Yij − Ȳ·· )2
= b ∑i (Ȳi· − Ȳ·· )2 + a ∑ j (Ȳ· j − Ȳ·· )2 + ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ·· )2
since the cross products vanish when summed. This can be written
symbolically as
SStotal = SS A + SSB + SSe
with degrees of freedom
( ab − 1) = ( a − 1) + (b − 1) + ( a − 1)(b − 1)
Thus the total sums of squares can be split into three sums of
squares for treatments, blocks and error respectively. Using the
theory of quadratic forms, the sums of squares are independent and
each has a χ2 distribution (Cochran’s Theorem).
E(Ȳi· − Ȳ·· ) = αi
Then
E( MStreat )
= E
h
SStreat
( a −1)
= E
h
b
a −1
= σ2 +
i
∑i (Ȳi· − Ȳ·· )2
b
a −1
i
∑ α2i
as for the CRD, except that now blocks are the replicates.
93
sta2005s: design and analysis of experiments
94
Also
= b−a 1 ∑ j β2j + σ2
E( MSE) = σ2
E( MSblocks )
and
So
F=
MS A
∼ F(a−1),(a−1)(b−1)
MSE
If H0 : α1 = α2 = . . . = α a = 0, then reject H0 if F > F(αa−1),(a−1)(b−1) .
A
If H0 is false, MS
MSE has a non-central F distribution with
non-centrality parameter
λ=
b ∑ α2i
σ2
and ( a − 1) and ( a − 1)(b − 1) degrees of freedom. This distribution
can be used to find the power of the F-test and to determine the
number of blocks needed to guarantee a specific power (see
Chapter 6.)
Source
SS
)2
Treatments A
SS A = b ∑i (Ȳi· − Ȳ··
Blocks B
SSB = a ∑ j (Ȳ· j − Ȳ·· )2
SSE = ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ··
Error
SStotal = ∑(Yij − Ȳ··
Total
)2
df
MS
F
EMS
( a − 1)
SS A
( a −1)
SSB
( b −1)
SSE
( a−1)(b−1)
MS A
MSE
MSblocks
MSE
b ∑ α2i
( a −1)
a ∑ β2
σ2 + (b−1i)
σ2
( b − 1)
)2
( a − 1)(b − 1)
ab − 1
The hypothesis of interest here is H0 : α1 = . . . = α a = 0. If we find
differences between the treatments, they are further investigated by
lookind at specific contrasts (Chapter 4).
1. planned comparisons: t-tests and confidence intervals.
2. orthogonal contrasts
3. orthogonal polynomials it the treatments are ordered and
equally spaced
4. unplanned comparisons
Computing Formulae
(∑ Yij )2
ab
SS“tot”
=
=
abȲ··2 =
∑ Yij2 − C
SS A
SSB
=
=
b ∑ Ȳi2· − C
a ∑ Ȳ·2j − C
SSe
=
SS“tot” − SS A − SSB
C
Table 5.3: Analysis of Variance Table
for the Randomised Block Design with
model Yij = µ + αi + β j + eij
σ2
+
sta2005s: design and analysis of experiments
95
Estimates
µ̂
α̂i
β̂ j
=
=
=
Ȳ··
Ȳi· − Ȳ··
Ȳ· j − Ȳ··
Example: Timing of Nitrogen Fertilization for Wheat
Current recommendations for nitrogen fertilisation were developed
through the use of periodic stem tissue analysis for nitrate content
of the plant. This was thought to be an effective means to monitor
nitrogen content of the crop and a basis for predicting optimum
production. However, stem nitrate tests were found to over-predict
nitrogen amounts. Consequently the researcher wanted to evaluate
the effect of several different fertilization timing schedules on the
stem tissue nitrate amounts and wheat production to refine the
recommendation procedure (Source: Kuehl 2000).
The treatment structure included six different nitrogen application
timing and rate schedules that were thought to provide the range of
conditions necessary to evaluate the process. For comparison, a
control treatment of no nitrogen was included as was the current
standard recommendation.
The experiment was conducted in an irrigated field with a water
gradient along one direction of the experimental plot area as a
result of irrigation. Since plant responses are affected by variability
in the amount of available moisture, the field plots were grouped
into blocks of six plots such that each block occurred in the same
part of the water gradient. Thus, any differences in plant responses
caused by the water gradient could be associated with the blocks.
The resulting experimental design was a randomized (complete)
block design with four blocks of six field plots to which the
nitrogen treatments were randomly allocated.
The layout of the experimental plots in the field is shown in Table
5.4. The observed nitrate content (ppm ×102 ) from a sample of
wheat stems is shown for each plot along with the treatment
numbers, which appear in the small box of each plot.
Block 1
2
40.89
5
37.99
4
37.18
1
34.98
6
34.89
3
42.07
Irrigation
Gradient
Block 2
1
41.22
3
49.42
4
45.85
6
50.15
5
41.99
2
46.69
⇓
Block 3
6
44.57
3
52.68
5
37.61
1
36.94
2
46.65
4
40.23
Block 4
2
41.90
4
39.20
6
43.29
5
40.45
3
42.91
1
39.97
Table 5.4: Observed nitrate content
(ppm ×102 ) from samples of wheat
stems from each plot. First row in each
block indicates treatment number.
sta2005s: design and analysis of experiments
The linear model for this randomized block design is
Yij = µ + αi + β j + eij
where µ is the overall mean, αi is the nitrogen treatment effect, β j is
the block effect, and eij is the experimental error assumed
∼ N (0, σ2 ). Treatment and block effects are assumed to be additive.
Block Means:
Treatment Means:
----------------------1
2
3
----------------------------------------------
4
control
2
3
4
5
6
44.03
46.77
40.62
39.51
43.23
38.00 45.89 43.11 41.29
38.28
-----------------------
----------------------------------------------
ANOVA Table:
---------------------------------------------Df Sum Sq Mean Sq F value
Pr(>F)
TREATMNT
5 201.32
40.263
5.5917 0.004191
BLOCK
3 197.00
65.668
9.1198 0.001116
15 108.01
7.201
Residuals
----------------------------------------------
The blocked design will markedly improve the precision on the
estimates of the treatment means if the reduction in SSE with
blocking is substantial.
The F statistic to test for differences among the treatment means is
F = 5.59. The p-value is 0.004, suggesting differences between the
nitrogen treatments with respect to stem nitrate. There is usually
little interest in a formal inference about block effects, although we
might be interested in whether blocking increased the efficiency of
the design, which it did if F > 1.
Treatment 4 was the standard fertilizer recommendation for wheat.
We could now compare each of the treatments to treatment 4 to see
if any differ from the current recommended treatment. The control
gives a means of evaluating the nitrogen available without
fertilization.
5.2 Missing values – unbalanced data
Easy analysis of the Randomised Block Design depends on having
an observation in each cell of the two-way table, i.e. each treatment
appears once in each block. We call this a balanced design. Balanced
designs ensure that block and treatment effects can be estimated
independently. This greatly simplifies interpretation of results.
96
sta2005s: design and analysis of experiments
More generally, data or designs are balanced when we have the
same number of observations for all factor level combinations.
Missing observations result in unbalanced data.
What happens if some of the observations in our RBD experiment
are missing? This could happen if an experimental unit runs away
or explodes, or dies or becomes sick during the experiment and can
no longer participate.
Then we no longer have a balanced design. Refer back to the layout
of the RBD (Table 5.1). If we have no missing observations, and we
compare treatment 1 to treatment 2, on average, they don’t differ
with respect to anything except for the treatment (exactly the same
block contributions are made in each treatment, and there are no
interactions, which means that the block effect is the same for each
treatment). Now, what happens if one observation is missing?
The problem is the same we have in regression, where coefficients
are interpreted conditional on the values of all other variables in the
model. There the variables are all more or less correlated. In such a
case it is not entirely possible to extract the effect of a single
predictor variable, the coefficient or effect depends on what other
terms are in the model. The same happens when the data from an
experiment become unbalanced. For example, the treatment i effect
can no longer be estimated by Yi. − Y.. , which would give a biased
estimate for αi .
There are two strategies to deal with unbalanced data. The first is
to estimate the value, substitute it back in, but reduce the error
degrees of freedom accordingly. The advantage is that the data
become balanced, and the results are as easy to interpret as before:
we are exactly able to attribute variation caused by differences
between treatments and variation caused by differences between
blocks. The second strategy is to fit a regression model. The least
squares estimates from this model will still give you the best
possible estimate of the treatment effects, provided you have
accounted for blocks, i.e. the blocking factor must be in the model.
Sums of squares can’t be split exactly any more, but we would base
our F-test for treatments on the change in variation explained
relative to a full model except for the treatment term, i.e. change in
variation explained when the treatment factor is added last.
You don’t need to remember the formulae for estimating the
missing values. You would get the same value when fitting a
regression model and from that obtain the estimate for the missing
value; you should know how to go about this in practice (both
strategies).
1. In the case of only one or two observations missing, one could
estimate the value of the missing treatment, based on the other
97
sta2005s: design and analysis of experiments
observations. The error degrees of freedom are reduced
accordingly, by the number of estimated observations.
1
2
..
.
i
..
.
a
Treatment
1
-
2
-
Blocks
···
j
..
.
-
-
Yij
Treatment Totals
···
-
.
B′
Block Totals
T′
..
-
b
-
G ′ ( N = ab)
Suppose observation Yij is missing. Let u be our estimate of the
missing observation. The least squares estimate of the
observation Yij would be
u = µ̂ + α̂i + β̂ j
Let T ′ be the sum of the (b − 1) observations on the ith treatment.
Let B′ be the sum of the ( a − 1) observations on the jth block.
Let G ′ be the sum of the N − 1 observations on the whole
experiment.
Then
µ̂
α̂i
β̂ j
=
=
=
G ′ +u
N
T ′ +u
b
B′ +u
a
−
−
G ′ +u
N
G ′ +u
N
So
u
=
=
G ′ +u
N
T ′ +u
b
′
′
+ T b+u − G N+u +
′
′
+ B a+u − G N+u
B′ +u
a
−
G ′ +u
N
Hence
u=
aT ′ +bB′ − G ′
(b−1)( a−1)
The estimate of the missing value is a linear combination of the
other observations. It can be shown that it is the value u which
minimizes the SSE when ordinary ANOVA is carried out on the
N data points (the ( N − 1) actual observations and u).
Since the missing value is a linear combination of the other
observations it follows that Ȳi· - the ith treatment is correlated
with the other means. If there is a missing observation on the ith
treatment it can be shown that the variance of the estimated
difference between treatment i and any other, i′ is
98
sta2005s: design and analysis of experiments
Var (Ȳi· − Ȳi′ · ) = a2
h
2
b
+
a
b(b−1)( a−1)
i
If there are 2 missing values we can repeat the procedure above
and solve the simultaneous equations
u1
u2
= µ̂ + α̂i + β̂ j
= µ̂ + α̂i′ + β̂ j′
One degree of freedom is subtracted from the error degrees of
freedom for each missing value estimated. Thus the degrees of
freedom of s2 (MSE) are ( a − 1)(b − 1) − k, where k is the
number of missing values. The degrees of freedom of the F tests
are adjusted accordingly.
2. Alternatively, one can estimate the parameters using a linear
regression model. But because treatments and blocks are no
longer orthogonal (independent), the order in which the terms
enter the model will become important, and interpretation may
become more difficult.
The estimates obtained from fitting a regression model are ‘last
one in’ estimates. This means they estimate the change in
response after all other variables in the model have explained
what they can, i.e. variation in residuals. So are the t-tests. If we
want to conduct an ANOVA table in a similar way (last one in)
we cannot use R’s aov function, which calculates sequential SS.
Sequential ANOVA tables test change in variance explained
when adding each term, given all previous terms in the model.
The SSs and F-tests will thus change and give different results
depending on the order in which the terms appear in the model.
The Anova function in R’s car package uses Type II sums of
squares, i.e. calculating SSs as last-one-in, as the regression t-tests
do: each SS is calculated as change in SS explained compared to
SS explained given all other terms in the model.
5.3 Randomization Tests for Randomized Block Designs
Example: Boys’ Shoes
Measurements were made of the amount of wear of the soles of the
shoes worn by 10 boys. The shoe soles were made by two different
synthetic materials, A and B. Each boy wore a pair of special shoes,
the sole of one shoe having been made with A and the sole of the
other with B. The decision as to whether the left or right sole was
made with A or B was made with a flip of a coin. The following
table gives the data. It illustrates that A showed less wear than B,
since ( B − A) > 0 most generally.
99
sta2005s: design and analysis of experiments
sideA Material B sideB Difference d = B − A
L
14.0
R
0.8
L
8.8
R
0.6
R
11.2
L
0.3
L
14.2
R
-0.1
R
11.8
L
1.1
L
6.4
R
-0.2
L
9.8
R
0.3
L
11.3
R
0.5
R
9.3
L
0.5
L
13.6
R
0.3
¯
Average difference d = 0.41
Since material A was standard, and B a cheaper substitute, we
wished to test whether B resulted in increased wear. This implies
H0 : µ A = µ B (no difference in wear between materials A and B)
H1 : µ A < mu B (increased wear with B)
Boy
1
2
3
4
5
6
7
8
9
10
Material A
13.2
8.2
10.9
14.3
10.7
6.6
9.5
10.8
8.8
13.3
For matched pairs we instead write the hypotheses as follows:
H0 : µ D = 0 (no difference in wear between materials A and B)
H1 : µ D < 0 (increased wear with B)
where µ D is the mean difference in wear calculated as: wear with
material A - wear with material B.
The observed sequence of tosses leading to the above treatment
allocations was (head implies A worn on right foot):
T
T
H
T
H
T
T
T
H
T
Under H0 , A and B are merely labels and could be swopped
without making a difference. Hence boy 1 could have worn A on
the right and B on the left foot, resulting in a difference of
B − A = 13.4 − 14 = −0.8. Similarly, boy 6 could have worn A on
the right and B on the left foot, giving B − A = 6.6 − 6.4 = +0.2.
This implies the actual values of wear, and hence the values of the
differences do not change, but the signs associated with these
differences do.
The given sequence of coin tosses is one of 210 = 1024 equally
probable outcomes: There exist 2 orderings for each pair, the 10
pairs are independent, hence there are 2 × 2 × 2 × 2 . . . = 210
different orderings.
To test H0 , the actually observed difference of 0.41 may be
compared with all possible 1024 average differences that could have
occurred as a result of different outcomes of coin tosses. To obtain
these 1024 average differences we need to average the differences
for all possible combinations of + and - signs. This is hard work! So
lets think how can we obtain average differences greater than the
observed 0.41: Only when the positive differences stay the same
and one or both of the negative differences become positive (since
the 2 negative differences were associated with the smallest
100
sta2005s: design and analysis of experiments
absolute values!). This implies 3 possible differences > 0.41. Four
further samples give values of d¯ = 0.41. This implies a p-value of
7
1024 = 0.007.
Questions
1. What are your conclusions about the shoe sole materials?
2. What parametric test could be used for the above example? What
are its assumptions about the data?
3. How does its p-value compare to the one obtained above?
4. Give the basic idea of randomization tests, how do they work?
5.4 Friedman Test
The Friedman test is a non-parametric test to test for for differences
between treatments (k ≥ 3), in a randomized block design (several
related samples). It is the non-parametric equivalent of the F-test in
a two-way ANOVA. This is an extension of matched pairs to more
than two samples. The test can be used on on ranked data.
The null and alternative hypotheses are:
H0 : Populations are identical
H1 : Populations are not identical (at least one treatment tends to
yield larger values than at least one other treatment).
The test statistic is calculated as follows:
• rank observations within each block
• Let R j = sum of ranks in column (treatment) j, ( j = 1, . . . , k )
R j = ∑ib=1 R( Xij )
T
12
=
bk(k + 1)
12
=
bk(k + 1)
k
∑
b ( k + 1)
Ri −
2
2
k
∑ R2i − 3b(k + 1)
If ties are present, the above statistic T needs to be adjusted. In that
case one can use the sums of squares of the ranks to calculate the
following statistic:
T2 =
MStreatment
∼ Fk−1,(b−1)(k−1)
MSE
approximately. This is always a one-sided test.
Critical Values
101
sta2005s: design and analysis of experiments
102
• For small b and k use the Friedman Table.
• For large b and k:
T ≈ χ2k−1
Example
Six quality control laboratories are asked to analyze 5 chemicals to
see if they are performing analyses in the same manner. Determine
whether any of the labs are different from any others if the ranked
data are as follows.
Chemical
A
B
C
D
E
Ri
Lab 1
1
3
3
1
4
12
2
5
2
4
4
5
20
3
3
1
2
6
1
13
4
2
4
5
3
2
16
5
4
6
1
2
3
16
6
6
5
6
5
6
28
H0 : All labs identical
H1 : Labs not identical
k = 6, b = 5
T=
h
i
12
122 + 202 + 132 + 162 + 162 + 282 − 3 × 5 × 7 = 9.8
5×6×7
Using T ∼ χ25 we obtain a p-value = 0.08. There is little evidence to
indicate that the labs are performing the analyses differently.
5.5 The Latin Square Design
A Latin Square of order p is an arrangement of p letters each
repeated p times into a square of side p so that each letter appears
exactly once in each row and once in each column (a bit like
Sudoku).
Two Latin Squares of order 2
AB
BA
BA
AB
Three Latin Squares of order 3
‘Latin’ because the treatments are
usually denoted by the Latin letters A,
B, C, ....
sta2005s: design and analysis of experiments
ABC
BCA
CAB
CAB
BCA
ABC
BAC
CBA
ACB
A Latin Square can be changed into another one of the same order
by interchanging rows and columns. The Latin Square is a useful
way of blocking for 2 factors each with p levels without increasing
the number of treatments. The rows of the square denote one
blocking factor and the columns the other. The entries are p
treatments which are to be compared. We require only p2
experimental units. If we had made an observation on each
combination of 2 blocking factors and p treatments we would have
needed p3 experimental units.
Model for a Latin Square Design Experiment
Let Yijk = observation on the kth treatment in the ith row and jth
column of the square.
Then a suitable model is
Yijk
= µ + αiR + βCj + γkT + eijk
with identifiability constraints
∑ i α i = ∑ j β j = ∑ k γk = 0
where
µ
αi
βj
γk
eijk
eijk ∼ N (0, σ2 )
general mean
ith row effect
jth column effect
kth treatment effect
random error of observation
and are independent
Note that we cannot write i = 1, . . . , p, j = 1, . . . , p, and
k = 1, . . . , p, because not all the triplets i, j and k appear in the
experiment. So we write {ijk} ∈ D where D is the set of all triplets
appearing. When calculating, the set D is obvious - it is only the
derivation that is awkward notationally. We put the subscripts in
brackets to denote that we sum over the ijk’s actually present.
To obtain the least squares estimates we minimize
S = ∑ijk (Yijk − µ − αi − β j − γk )2
103
sta2005s: design and analysis of experiments
∂S
∂µ
= 0 gives ∑{ijk}∈ D Yijk = p2 µ since there are p2 observations.
µ̂ = Ȳ··· = ∑{ijk}
∂S
∂γk
Yijk
p2
= −2 ∑ij (Yijk − µ − αi − β j − γk ) = 0
Using the constraints we find
pµ + pγk
γ̂k
= Y··k
= Ȳ··k − Ȳ···
Similarly
α̂i
β̂ j
= Ȳi·· − Ȳ···
= Ȳ· j· − Ȳ···
The error is found by substituting µ̂ , α̂i , β̂ j and γ̂k in SSE to give
SSresidual = ∑{ijk}∈ D (Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· )2
and
σ̂2 =
SSresidual
( p − 1)( p − 2)
Test of the hypothesis H0 : γ1 = γ2 = . . . = γ p = 0
As in the other cases, we could derive a likelihood ratio test for H0 ,
by fitting two models - one which contains the γ’s and one that
does not.
We shall use the short method.
Consider
Yijk − µ̂ = α̂i + β̂ j + γ̂k + êijk
(Yijk − Ȳ··· ) = (Ȳi·· − Ȳ··· ) + (Ȳ· j· − Ȳ··· ) + (Ȳ··k − Ȳ··· )
+(Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· )
Squaring and summing over the p2 (ijk)’s present in the design
gives
∑(ijk) (Yijk − Ȳ··· )2
= p ∑(i) (Ȳi·· − Ȳ··· )2 + p ∑( j) (Ȳ· j· − Ȳ··· )2
+ p ∑(k) (Ȳ··k − Ȳ··· )2
+ ∑(ijk) (Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· )2
104
sta2005s: design and analysis of experiments
and so
SStot = SScol + SSrows + SStreatment + SSE
with df
p2 − 1 = ( p − 1) + ( p − 1) + ( p − 1) + ( p − 1)( p − 2)
As in previous examples, the SS’s are independent and when
divided by σ2 each have a χ2 distribution with appropriate degrees
of freedom.
The ANOVA table for the Latin square design is:
SS
Source
Rows
Columns
Treatment
Error
SSrow
SScol
SStreat
SSe
Total
SS“tot”
)2
= p ∑(i) (Ȳi·· − Ȳ···
= p ∑( j) (Ȳ· j· − Ȳ··· )2
= p ∑(k) (Ȳ··k − Ȳ··· )2
= ∑(ijk) (Ȳijk − Ȳi··
−Ȳ· j· − Ȳ··k + 2Ȳ··· )2
= ∑(ijk) (Yijk − Ȳ··· )2
df
( p − 1)
( p − 1)
( p − 1)
( p − 1)( p − 2)
MS
F
MStreat
MSe
MStreat
MSe
p2 − 1
Usually only treatments are tested but row and column differences
can be tested in the usual way. The power of the F test can be found
using the non-central F with ( p − 1) and ( p − 1)( p − 2) df and
non-centrality parameter
λ = p∑
γk2
σ2
Disadvantages of Latin Squares: Latin squares can only be used
when the number of treatments equals the number of rows equals
the number of columns. The model assumes no interactions
between rows, columns and treatments. If interactions are present
between the rows and columns then the treatment and error sums
of squares are biased.
Other uses of Latin Squares
The Latin Square can also be used in factorial experiments for
experimenting with 3 factors each with p levels using only ( 1p )th of
the complete design. For example a 53 factorial experiment has 125
treatments. If we are willing to assume that there are no
interactions we can arrange the 3 factors in a Latin Square with
rows for one factor, columns for another and letters for the third.
We then use only 25 treatments instead of 125, an 80% reduction.
105
sta2005s: design and analysis of experiments
106
Missing Values
The analysis of unbalanced data from Latin Square Designs follows
along the same lines as for Randomized Block Designs (see Section
5.2).
Suppose the Y value corresponding to the ith row, jth column and
kth letter is missing. It can be estimated by
u=
pRi′ + pCj′ + pTk′ − 2G ′
( p − 1)( p − 2)
where
Ri′ is the ith row sum without the missing value
Ci′ is the jth column sum without the missing value
Ti′ is the kth treatment sum without the missing value
G ′ is the total sum without the missing value
Example: Rocket Propellant
Suppose that an experimenter is studying the effects of five
different formulations of a rocket propellant used in aircrew escape
systems on the observed burning rate. Each formulation is mixed
from a batch of raw material that is only large enough for five
formulations to be tested. Furthermore, the formulations are
prepared by several operators, and there may be substantial
differences in the skills and experience of the operators.
Source: Montgomery DC. Design and
Analysis of Experiments.
A Latin square is the right design to use here, because we only have
enough material for 5 replicates of every treatment (formulation),
but want to block for batch and for operator. Each operator
prepares one of each formulation, and each batch is used to prepare
one of each formulation.
Batches of Raw Material
1
2
3
4
5
batch
operator
formulation
Residuals
Df
4
4
4
12
1
A = 24
B = 17
C = 18
D = 26
E = 22
Sum Sq
68.00
150.00
330.00
128.00
2
B = 20
C = 24
D = 38
E = 31
A = 30
Mean Sq
17.00
37.50
82.50
10.67
Operator
3
C = 19
D = 30
E = 26
A = 26
B = 20
4
D = 24
E = 27
A = 27
B = 23
C = 29
F value
1.59
3.52
7.73
Pr(>F)
0.2391
0.0404
0.0025
5
E = 24
A = 36
B = 21
C = 22
D = 31
Table 5.5: Rocket propellant data and
layout.
Table 5.6: ANOVA table for rocket
propellant data.
sta2005s: design and analysis of experiments
107
With Latin square designs we have to assume an additive model
(no interactions between the factors).
The ANOVA table suggests that there are differences between the
types of formulation in terms of burn rate (p = 0.0025), and that
there are differences between operators (p = 0.04). There is no
indication that burn rate differs between batches.
The next question of interest will be which formulations have the
highest burning rates. We have no a-priori hypotheses on these, so
a common approach is to compare all formulations pairwise, and
use Tukey’s method to adjust p-values and confidence intervals.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = rate ~ batch + operator + propellant, data = rocket)
$propellant
diff
lwr
upr
p adj
B-A -8.4 -14.9839317 -1.8160683 0.0110827
C-A -6.2 -12.7839317
0.3839317 0.0684350
D-A
1.2
-5.3839317
7.7839317 0.9754380
E-A -2.6
-9.1839317
3.9839317 0.7194121
C-B
2.2
-4.3839317
8.7839317 0.8204614
D-B
9.6
3.0160683 16.1839317 0.0041583
E-B
5.8
-0.7839317 12.3839317 0.0944061
D-C
7.4
0.8160683 13.9839317 0.0254304
E-C
3.6
-2.9839317 10.1839317 0.4461852
E-D -3.8 -10.3839317
a
b
c
a
b
2.7839317 0.3966727
a
c
95% family−wise confidence level
c
B−A
linear predictor
C−A
D−A
30
E−A
C−B
25
D−B
E−B
D−C
20
E−C
E−D
A
B
C
formulation
D
E
−15 −10
−5
0
5
10
15
Differences in mean levels of formulation
The compact letter display plot (cld in library multcomp) nicely
summarises this: treatments with the same letter are not
significantly different. Formulation D had the largest mean burning
rate, but we can’t be sure that this is higher than for formulations A
Figure 5.3: Boxplots of burning rate
against formulation for rocket propellant data. On the right are confidence
intervals for pairwise differences between means, adjusted using Tukey’s
method.
sta2005s: design and analysis of experiments
and E, however the data seem to suggest that its burning rate is
higher than for formulations B and C.
Can we replicate the p-values and confidence intervals from R?
Let’s try the last line (formulations E and D):
µ D − µ E = 3.8
SE(mean) =
√
10.67/5 = 1.460822
p − value = P(q5,12 ≥ 3.8/1.460822) = 0.3968151
using
ptukey(3.8 /
1.460822, nmeans = 5, df = 12, lower.tail = FALSE)
q5,12 = 4.50771
using qtukey(0.95, 5, 12)
CI: 3.8 ± 1.460822 ∗ 4.50771 = [−2.784962, 10.38496]
The values agree to within 2-3 decimal places. That is fine, the
difference is because the values we have taken from the above
output are rounded, whereas R has used the exact values.
Blocking for 3 factors - Graeco-Latin Squares
This section is just added out of interest. Blocking for three factors,
e.g. cars, wheel positions and drivers, can be achieved by using a
Graeco-Latin Square. A Graeco-Latin Square is formed by taking a
Latin Square and superimposing a second square with the
treatments in Greek letters. For example
A
B
C
D
B
A
D
C
C
D
A
B
D
C
B
A
and
α
γ
δ
β
β
δ
γ
α
γ
α
β
δ
δ
β
α
γ
gives
Aα
Bγ
Cδ
Dβ
Bβ
Aδ
Dγ
Cα
Cγ
Dα
Aβ
Bδ
Dδ
Cβ
Bα
Aγ
If the two squares have the property that each Greek letter coincides
with each Latin letter exactly once then the squares are called
orthogonal. Complete sets of ( p − 1) mutually orthogonal
Graeco-Latin Squares exist whenever p is a prime or a power of a
prime. No orthogonal Graeco-Latin Square exists for p=6.
108
6
Power and Sample Size in Experimental Design
6.1 Introduction
An important part of planning an experiment is deciding on the
number of replicates required so that you will have enough power
to detect differences if these are there. Experiments are very time
consuming and expensive and it is always worthwhile to invest a
little time calculating required sample sizes. On the other hand
these calculations may show that you will need too many
replications, for which you don’t have the money, and it will tell
you that it is better not to start the experiment at all because it is
doomed to fail.
Although this is often referred to as sample size calculations, in
experiments we are not really talking about samples, but the
number of replicates needed per treatment.
Questions:
1. How can an experiment fail if the sample size was too small?
2. What is statistical power, in your own words?
3. Which are the 3 key ingredients that will determine statistical
power (in the experimental setting)?
Basically, the smaller the differences (effects) are that you want to
detect, the larger the sample sizes will have to be!
Power is defined as:
1 − β = Pr [reject | H0 false]
= Pr [ F > Fa−1,N −a | H0 false]
sta2005s: design and analysis of experiments
110
To calculate power we need the distribution of F if H0 is false.
Recall the Expected Mean Squares for treatment. If H0 is false, the
test statistic
F0 =
MStreat
MSE
has a noncentral F distribution with a − 1 and N − a degrees of
freedom (in the case of a CRD) and noncentrality parameter
a
λ = r ∑(µi − µ̄)2 /σ2
i
where r denotes the number of replicates per treatment. If λ = 0 we
have a central F-distribution. The power can be calculated for any
given r. Power is often chosen to lie around 0.8 - 0.9. An estimate
for σ2 can come from knowledge you have or from a pilot study.
Rather than having to specify the size of all effects it will be easier
to specify the smallest difference between any two treatment means
that would be physically meaningful. Suppose we want to detect a
significant difference if any two treatment means differ by
D = µi − µ j . With a larger non-centrality parameter Pr [ F > c]
increases, and power increases. So we want to ensure the smallest λ
in a given situation will lead to a rejection. This will ensure that the
power is at least as specified. The minimum λ when there is a
difference of at least D will then be
λ=
rD2
2σ2
Where did the noncentral Fdistribution come from?
Firstly, a noncentral chi-squared
distribution results from
where Xi ∼ N (µi , σ2 ).
∑in=1 Xi2 ,
This noncentral chi-squared distribution will have noncentrality parameter
λ = ∑ µ2i .
If the null hypothesis of equal
treatment means is false, the treatment
effects don’t have mean 0 but mean
αi and the treatment SS will have a
non-central chi-squared distribution.
The non-central F-distribution arises
as the ratio of a non-central chisquared distribution and a chi-squared
distribution.
Example: In a study to compare the effects of 4 diets on the weight
of mice of 20 days old, the experimenter wishes to detect a
difference of 10 grams. The experimenter estimates that the
standard deviation σ is no larger than 5 grams. How many
replicates are necessary to have a probability of 0.9 to detect a
difference of 10 grams using the F test with significance level
α = 0.05?
A difference of 10 grams means that the maximum and the
minimum treatment means differ by 10 grams. The noncentrality
parameter is smallest when the other two treatment means are all in
the middle, i.e. the four treatment means are a, a + 5, a + 5, a + 10
for some constant a.
Here is some R code for the above example, followed by the output.
---------------------------------------------------------------------------a <- 4
D <- 10
sta2005s: design and analysis of experiments
sigma <- 5
alpha <- 0.05
df1 <- a - 1
for (r in 2:9)
{ df2 <- a * (r - 1)
ncp <- r * D^2 / (2 * sigma^2)
fcrit <- qf(alpha, df1, df2, lower.tail = FALSE)
# this is the critical value of the
# F-distribution under H0
power <- 1 - pf(fcrit, df1, df2, ncp)
cat(r, power, ncp, "\n")
}
---------------------------------------------------------------------------r power
ncp
2 0.1698028 4
3 0.3390584 6
4 0.503705
8
5 0.6442332 10
6 0.7545861 12
7 0.8361289 14
8 0.8935978 16
9 0.9325774 18
-----------------------------
ncp is the non-centrality parameter. r
= 9 replicates will give us a
power > 0.90.
6.2 Two-way ANOVA model
Consider factor A with a levels, and a second factor B (a blocking or
a treatment factor) with b levels. The non-centrality parameter will
be
a
λ = b ∑ (µi − µ̄)2 /σ2
i =1
and power for detecting differences between the levels of A can be
calculated similarly to above, except that the degrees of freedom in
the F-tests will change.
These notes refer to power for the special case of the ANOVA F-test.
In general we would need to know the distribution under the
alternative hypothesis in order to calculate power.
111
sta2005s: design and analysis of experiments
References
http://www.stat.purdue.edu/~zhanghao/STAT514/handout/
chapter03/PowerSampleSize.pdf
112
7
Factorial Experiments
7.1 Introduction
Up to now we have developed methods to compare a number of
treatments. We have thought of the treatments as having no special
relationships among themselves and they each influence the
response Y independently. Any other factors which may have
influenced Y were removed by blocking or randomisation so that
we could make more sensitive comparisons of the treatments. In
the language of experimental designs, these are called single factor
experiments - the individual treatments are the levels of the factor.
There are many situations where the behaviour of the response Y
cannot be understood by looking at factors one at a time. We need
to consider the influence of several factors acting simultaneously.
For example:
1. The yield of a crop might depend on the amount of nitrogen and
the amount of potassium in the fertilizer.
2. The response to the drug may depend on the age of the patient
and the severity of his illness.
3. The yield of a chemical compound may depend on the pressure
and the temperature at which the chemical reaction takes place.
Factorial experiments allow us to evaluate the effect of each factor
on its own and to study the effect of a number of them working
together or interacting.
Example 1: Bond Strength
Three types of adhesive (glue) are being tested in an adhesive
assembly of glass specimens. A tensile test is performed to
determine the bond strength of the glass to glass assembly. Three
sta2005s: design and analysis of experiments
different types of assembly (cross-lap, square-centre and
round-centre) are tested. The following table shows the bond
strength on 45 specimens.
Adhesive
047
Cross-lap
16
14
19
18
19
Assembly
Square-Centre
17
23
20
16
14
00T
23
18
21
20
21
24
20
12
21
17
24
21
25
29
24
001
27
28
14
26
17
14
26
14
28
27
17
18
13
16
18
Adhesive (Factor A)
Assembly (Factor B)
Response (Y)
No. of Levels
3
3
Round-Centre
13
19
14
17
21
1
2
3
047
00T
001
Cross Square Round
Bond Strength
Model
Yijk = µ + αiA + β Bj + (αβ)ijAB + eijk
∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0
i
j
7.2 Basic Definitions
Consider an experiment where a response Y is measured a number
of times. Each measurement is called a trial.
1. Factor
Any feature of the experiment that can be changed from trial to
trial is called a factor. Factors can be qualitative or quantitative.
114
sta2005s: design and analysis of experiments
• Examples of qualitative factors are: colour, sex, social class,
severity of disease, residential area. Strictly speaking, sex and
social class are not treatment factors. However, one may still
be interested in differences (between sexes, say). In this case
one analyses such data exactly as a factorial experiment.
However, interpretation can only be about association, not
causality.
• Examples of quantitative factors are: temperature, pressure,
age, income.
Factors are denoted by capital letters, e.g. A, B, C etc..
2. Levels
The various values of the factor examined in the experiment are
called its levels.
Suppose temperature (T) is a factor in the experiment. Then the
levels of T might be chosen as 0◦ C, 10◦ C, 20◦ C. If Colour is a
factor C then the levels of C might be Red, Green, Blue.
Sometimes the levels of a quantitative factor are treated
qualitatively, e.g. the levels of temperature T are cold, warm and
hot. The levels of a factor are denoted by subscripts: T1 , T2 , T3
are the levels of factor T.
3. Treatment
A combination of a single level from each factor in the
experiment is called a treatment.
Example:
Suppose we wish to determine the effects of temperature (T) and
Pressure (P) on the yield of Y of a chemical compound. If T has
two levels 0◦ and 10◦ and P has three levels, Low, Medium and
High, the treatments would be:
0◦ and Low pressure
0◦ and Medium pressure
0◦ and High pressure
10◦ and Low pressure
10◦ and Medium pressure
10◦ and High pressure
There are 2 × 3 = 6 treatments on the experiment and a number
of measurements on Y would be made on each of the treatments.
4. Effect of a factor
The change in the response produced by a change in the level of
the factor is called the effect of the factor. There are two types of
effects, main effects and interaction effects.
5. A Main effect is the average change in the response produced by
changing the level of a single factor. It is the average over all the
levels of the other factors in the experiment. Thus in the
experiment above, the main effect of a temperature of 0◦ would
be the average change in yield of the compound averaged over
the three pressures low, medium and high relative to the overall
mean. All the effects we have looked at so far were main effects.
115
sta2005s: design and analysis of experiments
6. Interaction: If the effect of a factor depends on the level of
another factor that is present then the two factors interact. For
example, consider the amount of risk people are willing to take
and the two factors gender and situation. Women might be
willing to take high risks in one situation but very little in
another, while for men this willingness might be directly
opposite. So the response depends not only on situation or only
on gender but one has to look at the particular combination of
factors. Therefore, if interactions are present, it will not be
meaningful to interpret the main effects, it is not very
informative to know what women risk on average, one will have
to look at the combination of gender and situation to understand
the willingness to take risks.
7. Interaction effect: The interaction effect is the change in
response (compared to the overall mean) over and above the
main effects at a certain combination.
8. Fixed and Random effects: Effects can also be classified as Fixed
or Random. Suppose the experiment were repeated a number of
times. If the levels of the factors are the same each time the
experiment is repeated then the effects are called fixed and the
results only apply to the levels used in the experiments. If the
levels of a factor are chosen at random from a population of
levels each time the experiment is repeated, then the effects are
called random.
Example:
If we are only interested in temperatures of 0◦ and 10◦ the
temperature would be a fixed effect. If the experiment were
repeated we would use 0◦ and 10◦ again. If we were interested
in the range of temperatures from 0◦ to 20◦ say, and each time
we ran the experiment we decided on two Temperatures at
random, then temperature would be a random effect.
The arithmetic of the analysis of variances is exactly the same for
both fixed and random effects but the interpretations of the
results, the expected mean squares and tests are completely
different. We shall deal mainly with fixed effects.
9. In a complete factorial experiment every combination of factor
levels is studied and the number of treatments is the product of
the number of levels of each factor.
Example:
If we examine the effect of 3 factors A, B and C on response Y
and A has two levels, B has 3 and C has 5 then we have
2 × 3 × 5 = 30 treatments. The design is called a 2 × 3 × 5
factorial design.
116
sta2005s: design and analysis of experiments
117
7.3 Design of Factorial Experiments
Factorial refers to the treatment structure. A factorial experiment is
an experiment in which we have at least 2 treatment factors, and
the treatments are constructed by crossing the treatment factors, i.e.
the treatments are constructed by having every possible
combination of factor levels (mathematically, the Cartesian
product). At least for full factorial experiments. There are also
fractional factorial experiments where some of the treatment
combinations are left out, by design, but we will not deal with those
in this course.
Very often, more than one factor affects the response. For example,
in a chemical experiment, temperature and pressure affect yield. In
an agricultural experiment, nitrogen and phosphate in the soil
affect yield. In a sports health experiment, improvement in fitness
may not only depend on the physical training program, but also on
the type of motivation offered.
Example: Effect of selling price and type of promotional campaign on
number of items sold
price (R)
Effect of selling price (R55, R60, R65) and type of promotional
campaign (radio, newspaper, website pop-ups) on the number of
products sold (new type of cell-phone contract). There are two
treatment factors (price and type of promotion). If we are going to
use a factorial treatment structure, there are 3 × 3 = 9 treatments.
The experimental units could be different towns.
65
x
65
60
x
60
55
x
55
radio
web
newsp
type of campaign
x
radio
x
web
x
newsp
type of campaign
65
x
x
x
60
x
x
x
55
x
x
x
radio
web
newsp
type of campaign
If we only experiment with one factor at a time, we need to keep all
other factors constant. But in this way, we can never find out
whether factor A would have influenced the response differently at
another level of B, i.e. we cannot look at interactions (if we did two
separate experiments to repeat all levels of A at another level of B, B
and time would be confounded).
Figure 7.1: Illustration of different
ways of investigation two treatment
factors. The first two figures are two
one-factor-at-a-time experiments. The
figure on the RHS illustrates a factorial
experiment.
sta2005s: design and analysis of experiments
118
Interactions
We have encountered interactions in the RBD chapter. There we
specifically assumed NO interactions. Interactions are often the
most interesting parts of experiments, and with factorial
experiments we can investigate them.
Factors A and B are said to interact if the effects of factor A depend
on the level of factor B (or the other way around).
A and B interact
●
b1
●
b2
●
Y
Y
no interaction
●
b1
●
b2
●
●
Figure 7.2: Two different scenarios
of response in a factorial experiment.
On the LHS factors A and B do not
interact. On the RHS, factors A and
B do interact. Y is the response. A
is a factor with 2 levels (a1 and a2),
B is a factor with levels b1 and b2.
The points indicate the mean response
at a certain treatment, or factor level
combination.
●
a1
a2
a1
a2
Consider Figure 7.2. This is a factorial experiment (although it
could also be illustrating results in a RBD, where B, with levels b1
and b2, would denote a blocking factor instead of a second
treatment factor). Remember, what we mean by effect: the change in
mean response relative to some baseline level. In ANOVA models,
effect usually refers to change in mean response relative to overall
mean. But, to understand interactions, I am here going to use effect
as the change in mean response relative to the baseline level (first
level of the factor).
In the left-hand plot, when changing from a1 to a2 at level b1, the
mean response increases by a certain amount. When changing from
a1 to a2 at level b2, that change in mean response is exactly the
same. In other words the effect of A (when changing from a1 to a2)
on the mean response does not depend on the level of B, i.e. A and
B do not interact.
In the right-hand plot, when changing from a1 to a2 at level b1, the
mean response increases, when changing from a1 to a2 at level b2,
the mean response decreases: the effect of A depends on the level
of B, i.e. A and B interact.
If there is no interaction, the lines will be approximately parallel
(remember random variation).
Note that interaction (between A and B) and independence of A and
B are two completely different things/concepts! We have designed
the experiment so that A and B are independent! Yet they can
Constructing interaction plots: 1.)
If one of the variables is continuous
(even if only measured at a few discrete points), this should go on the
x-axis. 2.) Putting the variable with
more levels on the x-axis makes the
plot easier to interpret.
sta2005s: design and analysis of experiments
interact. The interaction refers to what happens to the response at a
certain combination of factor levels, not whether A and B are
correlated or not.
Interactions are interesting because they will indicate particularly
good or bad combinations and how one factor effects the response
in the presence of another factor, and very often the response of one
factor depends on what else you manipulate or keep constant.
x
x
x
60
x
x
x
55
x
x
x
radio
web
newsp
type of campaign
65
x
x
x
60
x
x
x
55
x
x
x
radio
web
newsp
type of campaign
price (R)
65
price (R)
price (R)
If interactions are suspected to be present, factorial experiments are
much more efficient than one-factor-at-a-time experiments.
Consider again the campaign example above. If I investigated first
the one factor, then in a second experiment only the second factor, I
would need at least 12 experimental units (2 × 3 + 2 × 3, 2 replicates
per treatment), and probably twice the amount of time. On the
other hand, I could run a factorial experiment all at once, with a
minimum of 9 experimental units which would allow me to
estimate all main effects. I would need a minimum of 18
experimental units (two replicates for each treatment) to also
estimate interaction effects.
65
x
x
x
60
x
x
x
55
x
x
x
radio
web
newsp
type of campaign
In a factorial experiment one can estimate main effects even if
treatments are not replicated. See Figure 7.3. On the LHS, I can
estimate the average response (number of items sold) with a web
campaign. This will give me the main effect of web campaign when
compared to the overall mean, i.e. on average, what happens with
web campaign. In other words, the main effects measure the
average change in response with the particular level, averaged over
all levels of the other factors, averaged over all levels of price in this
example.
Similarly, I can estimate the main effect of price R55, by taking the
average response with R55 over all levels of campaign type. This is
sometimes called hidden replication: even though the treatments are
not replicated, the levels of each factor are.
Can I estimate the interaction effects when there treatments are not
replicated? In an a × b factorial experiment, there are a × b
interaction effects, one for every treatment. The interaction effect
measures how different the mean response is relative to the sum of
the main effects (µ + αi + β j ).
119
sta2005s: design and analysis of experiments
120
Consider the RHS plot in Figure 7.3, and the typical model for a
factorial experiment with 2 treatment factors:
Yijk = µ + αi + β j + (αβ)ij + eijk
In order to estimate the interaction effect (αβ)ij , we need to
compare the mean response at this treatment to µ + αi + β j (the sum
of the main effects). But there is only one observation here, and we
need this observation to estimate eijk , i.e. the experimental unit and
interaction effect are confounded here. The only solution to this is
to have replication at that level. For example, if we want to estimate
the interaction effect of newspaper campaign with a price of R65,
we need to have replication at newspaper and R65 (and every other
campaign x price treatment).
One always needs to be able to estimate an error term
(experimental unit effect). If there is only one observation per
treatment, we need to assume that the effect of newspaper is the
same for all price levels. Then we can estimate an average (main)
effect for newspaper. But we cannot test or establish whether the
effect of newspaper is different in the different price levels.
If I want to estimate the effect of a particular combination of factor
levels, over and above the average effects (i.e. the interaction effect),
then I need replication at that combination of treatment levels.
Sums of Squares
SSerror
ab(n − 1)
SStotal
abn − 1
Figure 7.3: Breakdown of total sum of
squares in a completely randomized
factorial experiment, with two treatment factors.
SS AB
( a − 1)(b − 1)
SStreatment
ab − 1
SSB
b−1
SS A
a−1
To understand the sums of squares and the corresponding degrees
of freedom, think in terms of the design that was used. For
example, Figure 7.3 shows the break-down of the total sum of
squares in a completely randomized factorial experiment. Exactly
sta2005s: design and analysis of experiments
like in a CRD, the total sum of squares is split into error and
treatment sums of squares. There are a × b treatments, thus ab − 1
treatment degrees of freedom. There are abn experimental units in
total (ab treatments, each replicated n times), thus abn − 1 error
degrees of freedom. Sums of squares for main effects are as before,
and the interaction degrees of freedom is just the rest (or again the
typical cross-tabulation degrees of freedom that you have come
across in the RBD and in the chi-squared test for independence).
The treatment mean is calculated as the average of the n
observations for treatment ij, as before, and the interaction effect is
estimated as
([
αβ)ij = Ȳij. − (µ̂ + α̂i + β̂ j ) = Ȳij. − Ȳi.. − Ȳ.j. + Ȳ...
7.4
The Design of Factorial Experiments
Note that ‘factorial experiment’ refers to the treatment structure
and is not one of the basic experimental designs. Factorial
experiments can be conducted as any of the 3 designs we have seen
in earlier chapters.
Replication and Randomisation
1. To get a proper estimate of σ2 more than one observation must
be taken on each treatment - i.e. we must have replication. In
factorial experiments a replication is a replication of all
treatments (all factor level combinations).
2. The same number of units should be assigned to each treatment
combination. The total sum of squares can then be uniquely split
into components associated with the main effects of the factors
and their interactions. The sums of squares are independent.
This allows us to assess the effect of factor A, say, independently
of factor B, etc.. If unequal numbers of units are assigned to the
treatments, the simplicity of the ANOVA and interpretation
breaks down. There is no longer a unique split of the total sums
of squares into independent sums of squares associated with
each factor. The sum of squares for factor A will depend upon
whether or not factor B has been fitted. The conclusions that can
be drawn are not as clear as they are with a balanced design.
Unbalanced designs are difficult to analyse.
3. Factorial designs can generate a large number of treatments. If
factor A has 2 levels, B has 3 and C has 4, then there are 24
treatments. If three replications are made then 72 experimental
units will be needed. If there are not sufficient homogeneous
121
sta2005s: design and analysis of experiments
experimental units, they can be grouped into blocks. Each
replication could be made in a different block. Incomplete
factorial designs are available, in which only a carefully selected
number of treatments are used. See any of the recommended
texts for details.
Why are factorial experiments better than experimenting with one factor at a time?
Consider the following simple example: Suppose the yield of a
chemical process depends on 2 factors: (T) - the temperature at
which the reaction takes place and (P) - the pressure at which the
reaction takes place.
Suppose
and
T has 2 levels T1 and T2
P has 2 levels P1 and P2
Suppose we experiment with one factor at a time. We would need
at least 3 observations to give us information on both factors. Thus
we would observe Y at T1 P1 and T2 P1 which would measure a
change in temperature only because the pressure is kept constant. If
we then observed Y at T1 P2 we could measure the effect of change
on pressure only. Thus we have the following design
Change pressure only ↓
Change temperature only
−→
T1 P1
T2 P1
T1 P2
The results of the experiment could be tabulated as:
P1
P2
T1
(1)
(3)
T2
(2)
where (1) represents the observation.
The effect of change of temperature is given by (2) - (1).
The effect of change of pressure is given by (3) - (1).
To find an estimate of the experimental error we must duplicate all
(1),(2) and (3). We then measure the effects of the factors by the
appropriate averages of the readings and also estimate σ2 by the
standard deviation. Hence for effects and error estimates we need
at least 6 readings.
For a factorial experiment with the above treatments, we consider
every treatment combination.
P1
P2
T1 T2
(1) (2)
(3) (4)
122
sta2005s: design and analysis of experiments
i. The effect of change of temperature at P1 is given by (2) - (1).
ii. The effect of change of temperature at P2 is given by (4) - (3).
iii. The effect of change of pressure at T1 given by (3) - (1).
iv. The effect of change of pressure at T2 given by (4) - (2).
If there is no interaction, i.e. the effect of changing temperature
does not depend on the pressure level, then the estimates (i.) and
(ii.) only differ by experimental error and their average gives the
effect of temperature just as precisely as the duplicate observation
in (1) and (2) we needed in the one factor experiment. The same is
true for the pressure effect. Hence, if there is no interaction of
factors, we can obtain as much information with 4 observations in a
factorial experiment as we did with 6 observations varying only one
at a time. This is because all 4 observations are used to measure
each effect in the factorial experiment whereas in the one factor at a
time experiment only 23 of the observations are used to estimate
each effect.
Suppose the factors interact. If we experiment with one factor at a
time we have the situation as shown above. We see that T1 P2 and
T2 P1 give higher yields than T1 P1 . Could we assume that T2 P2
would be better than both? This would be true if factors didn’t
interact. If they interact then T2 P2 might be very much better than
both T1 P2 and T2 P1 or it may be very much worse. The “one factor
at a time" experiments do not tell us this because we do not
experiment at the most favourable (or least favourable)
combination.
Performing Factorial Experiments
Suppose we investigate the effects of 2 factors, A and B where A
has a levels and B has b levels. The a × b treatments are arranged in
a factorial design and the design is replicated n times. The a × b × n
experimental units are assigned to the treatments as in the
completely randomised design with n units assigned to each
treatment.
Layout of the factorial experiment
A
B
A1
A2
..
.
Aa
B1
x̃¯
x̃¯
..
.
x̃¯
B2
x̃¯
x̃¯
..
.
x̃¯
...
...
...
..
.
...
Bb
x̃¯
x̃¯
..
.
x̃¯
123
sta2005s: design and analysis of experiments
¯
˜
x
=
=
=
observations at 1st replicate
observations at 2nd replicate
observations at 3rd replicate etc
The entire design should be completed, then a second replicate
model made, etc. This is relatively easy to do in agricultural
experiments, since the replicates would be made simultaneously on
different plots of land. In chemical, industrial or psychological
experiments there is a tendency for a treatment combination to be
set up and a number of observations made, before passing on to the
next treatment. This is not a replication and if the experiment is
analysed as though it were, the estimate of the error variance will
be too small. Performed this way the observations within a
treatment are correlated and a different analysis should be used.
See Winer (pg. 391–394) for details.
7.5 Interaction
Consider the true means
B1
Factor A
µij
µ· j
µi ·
µ··
=
=
=
=
B2
Factor B
. . . Bj . . .
A1
A2
..
.
Ai
..
.
..
.
..
.
Aa
Bb
µij
µi ·
↑
↑
(2)
(1)
↓
↓
µ· j
µ··
true mean of the (ij)th treatment combination
true mean of jth level of B
true mean of ith level of A
true overall mean
Then
Main effect of ith level of A
Effect of
ith
level of A at
jth
level of B
= µi· − µ··
(7.1)
= µij − µ· j
(7.2)
If there is no interaction Ai will have the same effect at each level of
B. If there is interaction it can be measured by the difference of (8.2)
- (8.1),
124
sta2005s: design and analysis of experiments
µij − µ· j − µi· + µ··
= (αβ)ij
(7.3)
The same formula for the Ai Bj interaction would be found if we
started with the main effect for the jth level of B (µij − µ·· ) and
compared it with the effect of Bj at the ith level of A, µij − µi· .
From equation (5.3) we see that the interaction involves every cell in
the table, so if any cells are empty (i.e. no observations on that
treatment) the interaction cannot be found.
In practice we have a random error as well. Replacing the true
means in (5.3) by their sample means we estimate the interaction by
Ȳij· − Ȳi·· − Ȳ· j· + Ȳ···
7.6
(7.4)
Interpretation of results of factorial experiments
No interaction
Very rarely, if we a-priori do not expect the presence of any
interactions, we would fit the following model:
Yijk = µ + αi + β j + eijk
Interpretation is very simple. No interaction means that factors act
on the response independently of each other. Apart from random
variation, the difference between observations corresponding to any
level of A is the same for all levels of B and vice versa. The main
effects summarise the whole experiment.
Interaction present
Most often, interactions are one of the main points of interest in
factorial experiments, and we would fit the following model:
Yijk = µ + αi + β j + (αβ)ij + eijk
The plots of the means of B at each level of A may interweave or
diverge since the mean of B depends on what level of A is present.
If interactions are present, main effects estimate the average effect
(averaged over all levels of the other factor). For example if α1 > α2 ,
we must say that averaged over B α1 > α2 but for some levels of B,
125
sta2005s: design and analysis of experiments
it may happen that α1 < α2 . Interaction plots are very useful when
interpreting results in the presence of interaction effects. These
plots can give a good indication as to patterns and could be used to
answer some of the following or similar questions.
• What is the best/worst combination?
• How does the effect of A change with increasing levels of B?
• Which levels are always better than other levels, regardless of the
level of B?
• What is the effect of A at low/medium/high levels of B?
Some of these questions should be part of your a-priori hypothesis.
In statistical reports, however, plots are not enough. For a report on
any of the questions, we would rephrase them in the form of a
contrast and give a confidence interval to back up our statement.
Sometimes a large interaction indicates that a non-linear
relationship between response and treatment factors. In this case a
non-linear transformation of the response variable (e.g. a
log-transformation) may produce a model with no interactions.
Such transformations are called power transformations.
Power transformations
These are of the form
Z
Z
= y λ−1 λ ̸= 0
= log(y) λ = 0
λ
Special cases of these include the square root transformation or the
log transformation. A log transformation of the observations means
that we really have a multiplicative model
Yijk
= (eαi ) e β j (eeijk )
instead of the linear model Yijk = µ + αi + β j + eijk .
If the data are transformed for analysis, then all inferences such as
mean differences and confidence intervals are calculated from the
transformed values. After all these quantities the results are
transformed and expressed in terms of the original data.
The value of λ has to be found by trial and error or can be
estimated by maximum likelihood. It can sometimes make the
experiment difficult to interpret. Alternatively, if the interaction is
very large, the data can be analysed as a one-way layouts with (nb)
observations per treatment or as a completely randomised design
with (ab) treatments with n observations per treatment. When
126
sta2005s: design and analysis of experiments
experimenting with more than 2 factors higher order interactions
may be present, for example a 3 factor interaction or 4 factor
interaction. Higher order interactions are difficult to interpret. A
direct interpretation in terms of interactions is rarely enlightening.
A good discussion of principles to follow when higher order
interactions are present is given by Cox (1984). If higher order
interactions are present he recommends attempting one or more of
the following approaches:
1. transformation of the response;
2. fitting a non-linear model rather than a linear model;
3. abandoning the factorial representations of the treatments in
favour of a few possibly distinctive factor combinations. This will
in effect be grouping together certain cells in the table.
4. Splitting the factor combinations on the bases of one or more
factors, e.g. considering AB for each level of C.
5. Adopting a new system of factors for the description of the
treatment combinations.
7.7 Analysis of a 2-factor experiment
Let Yijk be the kth observation on the (ij)th treatment combination.
The full model is:
Yijk = µ + αi + β j + (αβ)ij + eijk
∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0
i
i
j
k
µ
αi
βj
(αβ)ij
=
=
=
=
j
= 1, . . . , a
= 1, . . . , b
= 1, . . . , n
general mean
main effect of the ith level of A
main effect of the jth level of B
interaction between the ith level of A and the jth level of B.
Note that (αβ) is a single symbol and does not mean that the
interaction is a product of the two main effects.
The “sum to zero" constraints are defined as part of the model, and
ensure that the parameter estimates subject to these constraints are
127
sta2005s: design and analysis of experiments
unique. Other commonly used constraints are the “corner-point "
constraints α1 = β 1 = αβ 1j = αβ i1 = 0. Again these estimates are
unique subject to these constraints, but different from those given
by the “sum to zero" constraints.
The maximum likelihood/least squares estimates are found by
minimizing
S=
∑(Yijk − µ − αi − β j − (αβ)ij )2
(7.5)
ijk
Differentiating with respect to each of the (ab + a + b +1)
parameters and setting the derivatives equal to zero gives
∂S
∂µ
∂S
∂αi
∂S
∂β i
∂S
∂(αβ)ij
=
=
=
=
−2 ∑ijk (Yijk − µ − αi − β j − (αβ)ij )
−2 ∑ jk (Yijk − µ − αi − β j − (αβ)ij )
−2 ∑ik (Yijk − µ − αi − β j − (αβ)ij )
−2 ∑k (Yijk − µ − αi − β j − (αβ)ij )
=
=
=
=
0
0
i = 1, . . . , a
0
j = 1, . . . , b
0
i = 1, . . . , a
j = 1, . . . , b
Using the side conditions the normal equations are
abnµ
bnµ + bnαi
anµ + anβ j
nµ + nαi + nβ j + n(αβ)ij
=
=
=
=
Y···
Yi··
Y· j·
Yij·
i
j
i
j
= 1, . . . , a
= 1, . . . , b
= 1, . . . , a
= 1, . . . , b
The solutions to these equations are the least squares estimates
µ̂
α̂i
β̂ j
ˆ )
(αβ
ij
=
=
=
=
Ȳ···
Ȳi·· − Ȳ···
Ȳ· j· − Ȳ···
Ȳij· − Ȳi·· − Ȳ· j· + Ȳ···
i = 1, . . . , a
j = 1, . . . , b
i = 1, . . . , a
j = 1, . . . , b
An unbiased estimator of σ2 is given by
s2
=
ˆ )ij )2
∑ijk (Yijk − µ̂ − α̂i − β̂ j − (αβ
ab(n − 1)
=
∑ijk (Yijk − Ȳij· )2
ab(n − 1)
(7.6)
Note that s2 is obtained by pooling the within cell variances and
could be written as
128
sta2005s: design and analysis of experiments
s
2
=
∑ij (n − 1)s2ij
ab(n − 1)
where s2ij is the estimated variance in the (ij)th cell.
7.8 Testing Hypotheses
When interpreting an ANOVA table, one should always start with
the highest order interactions. If strong interaction effects are
present, interpreting main effects needs to consider this. For
example if there is no evidence for main effects of factor A, this
DOES NOT mean that factor A does not affect the response.
• H AB : (αβ)ij = 0 for all i and j
interact)
(Factors A and B do not
• H A : αi = 0
i = 1, . . . , a
(Factor A has no effect)
• HB : β j = 0
j = 1, . . . , b
(Factor B has no effect)
The alternative hypothesis is, in each case, at least one of the
parameters in H is non-zero. The F–test for each of these
hypotheses effectively compares the full model to one of the three
reduced models:
1. Yijk = µ + αi + β j + (αβ)ij + eijk
2. Yijk = µ + αi + β j + eijk which is called the additive model.
3. Yijk = µ + αi + eijk which omits effects due to B.
4. Yijk = µ + β j + eijk which omits effects due to A.
The residual sum of squares from the full model provides the sum
of squares for error s2 with ab(n − 1) degrees of freedom. Denote
this sum of squares by SSE. To obtain the appropriate sums of
squares for each of the three hypotheses, we could obtain the
residual sums of squares from each of the models.
To test H AB : (αβ)ij = 0 we find µ̂, α̂i and β̂ j to minimize
S=
∑(Yijk − µ − αi − β j )2
ijk
Equating to zero
(7.7)
129
sta2005s: design and analysis of experiments
∂S
= −2 ∑(Yijk − µ − αi − β j ) = 0
∂µ
ijk
using the ‘sum to zero’ constraint gives µ̂ = Ȳ.
From
∂S
∂αi
= −2 ∑ jk (Yijk − µ − αi − β j ) = 0 for i = 1, . . . , a, we find
α̂i = Ȳi·· − Ȳ··· ,
µ̂ + α̂i = Ȳi··
β̂ j = Ȳ· j· − Ȳ···
Note that the least squares estimates for µ, αi and β j are the same
as under the full model. This is because the X’X matrix is
orthogonal (or equivalently block-diagonal). This will not be the
case if the numbers of observations per treatment differ
(unbalanced designs). The residual sum of squares under H AB is
SSres =
∑ ∑ ∑(Yijk − µ̂ − α̂i − β̂ j )2
i
j
(7.8)
k
The numerator sum of squares for the F test of H AB is given by the
difference between (7.8) and the residual sum of squares from the
full model. Regrouping the terms of (7.5) as
ˆ )ij )2
∑(Y| ijk − µ̂{z− α̂i − β̂}j − |(αβ
{z }
(7.9)
ijk
which can be written after squaring and summing as
ˆ )2
∑(Yijk − µ̂ − α̂i − β̂ j )2 − n ∑(αβ
ij
(7.10)
ij
ijk
since the cross-product terms are zero in summation. From (7.8)
and (7.10) we see that the numerator sum of squares for the F tests
is
SS AB
ˆ )2
= n ∑(αβ
ij
ij
= n ∑(Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· )2
ij
and SS AB has (a-1)(b-1) degrees of freedom.
Hence the F test of
130
sta2005s: design and analysis of experiments
H AB0 :
H AB1 :
131
All(αβ)ij = 0
versus
At least one interaction is non-zero
is made using
MS AB
∼ F(a−1)(b−1),ab(n−1)
MSE
where
MS AB =
SS AB
( a − 1)(b − 1)
and
MSE =
SSE
ab(n − 1)
Similar results can be derived for the test of H A and HB . Because of
the orthogonality of the design, the estimates of the main effects
under the reduced models are the same as under the full model.
Hence we can split the total sum of squares about the grand mean
SStotal
=
∑(Yijk − µ̂)2
ijk
=
∑(Yijk − Ȳ··· )2
ijk
uniquely as
SStotal = SS A + SSB + SS AB + SSe
and the degrees of freedom as
abn − 1 = ( a − 1) + (b − 1) + ( a − 1)(b − 1) + ab(n − 1)
The results are summarised in an Analysis of Variance Table.
Source
SS
df
MS
F
Expected Mean Square
A Main Effects
SS A
=
nb ∑i (Ȳi·· − Ȳ··· )2
( a − 1)
MS A
MS A
MSE
σ2 +
nb
a −1
∑i α2i
B Main Effects
SSB
=
na ∑ j (Ȳ· j· − Ȳ··· )2
( b − 1)
MSB
MSB
MSE
σ2 +
na
b −1
∑ j β2j
AB Interactions
SS AB
=
n ∑ij (Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· )2
( a − 1)(b − 1)
MS AB
MS AB
MSE
Error
SSE
=
∑ijk (Yijk − Ȳij· )2
ab(n − 1)
MSE
Total
SStotal
=
∑ijk (Yijk − Ȳ··· )2
abn − 1
σ2 +
n
( a−1)(b−1)
∑ij (αβ)2ij
σ2
Table 7.1: Analysis of variance table for
two-factor factorial experiment
sta2005s: design and analysis of experiments
The expected mean squares are found by replacing the observations
in the sums of squares by their expected values under the full
model, dividing by the degrees of freedom and adding σ2 . For
example
SS A = nb ∑(Ȳi·· − Ȳ··· )2
i
Now
(Ȳi·· − Ȳ··· ) = α̂i
and
E(α̂i ) = αi
since the least squares estimates are unbiased.
Hence
E( MS A ) = σ2 +
nb
α2i
a−1 ∑
i
7.9 Power analysis and sample size:
The non-centrality parameters for the F-tests are:
HA :
λ=
nb ∑ α2i
σ2
and the non-central F has (a-1) and ab(n-1) df
HB :
λ=
na ∑ β2j
σ2
and the non-central F has (b-1) and ab(n-1) df
H AB :
λ=
n ∑ αβ2ij
σ2
and the non-central F has (a-1)(b-1) and ab(n-1) df.
132
sta2005s: design and analysis of experiments
The non-centrality parameters can be used to determine the
number of replicates necessary to achieve a given power for certain
specified configurations of the parameters. The error degrees of
freedom is ab(n-1) where a = number of levels of A and b is the
number of levels of B. As a rough rule of thumb, we should aim to
have enough replicates to give about 20 degrees of freedom for
error. In higher-way layouts, where some of the interactions may be
zero, we can allow even fewer degrees of freedom for the error. In
practise, the number of replications possible is often determined by
the amount of experimental material, and the time and resources of
the experimenter. Nonetheless, a few power calculations are
helpful, especially if the F test fails to reject the null hypothesis. The
reason for this could be due to the F test having insufficient power.
7.10 Multiple Comparisons for Factorial Experiments
If the interactions were significant it makes sense to compare the ab
cell means (treatment combinations). If no interactions were found,
one can compare the levels of A and the levels of B separately. If
the treatment levels are ordered, it is preferable to test effects using
orthogonal polynomials. Both the main effects and the interactions
can be decomposed into orthogonal polynomial contrasts.
Questions:
1. On how many observations is each of the cell means based?
What is the standard error for the difference between two cell
means?
2. If we compare levels of factor A only, on how many observations
are the means based? What is the standard error for a difference
between two means now?
3. What are the degrees of freedom in the two cases above?
7.11 Higher Way Layouts
For p factors we have
( 1p) main effect A, B, C etc.
( 2p) 2 factor interactions
( 3p) 3 factor interactions
( pp) = 1 p-factor interaction
133
sta2005s: design and analysis of experiments
134
If there are n > 1 observations per cell we can split SStotal into
2 p − 1 component SS’s and an error SS. If p > 4 or 5, factorial
experiments are very difficult to carry out.
7.12 Examples
Example 2
A small experiment has been carried out investigating the response
of a particular crop to two nutrients, N and K. A completely
randomised design was used with six treatments arranged in a 3 × 2
factorial structure. N was applied in 0, 4 and 8 units, whilst K was
applied in 0 and 4 units.
The yields were as follows:
K4
N0
10.02
11.74
13.27
10.72
14.08
11.87
N4
20.65
18.88
16.92
19.33
20.77
21.70
N8
19.47
20.06
20.74
21.45
20.92
24.87
Df Sum Sq Mean Sq F value Pr(>F)
N
2
298.19
149.09
57.84
0.0000
K
1
10.83
10.83
4.20
0.0629
N:K
2
2.49
1.24
0.48
0.6286
Residuals 12
30.93
2.58
There is no evidence for an interaction between N and K (p = 0.63).
There is strong evidence that different N (nitrogen) levels affect
yield (p < 0.0001), but only little evidence that yield differed with
different levels of K (potassium) (p < 0.06).
Table 7.2: ANOVA table for nitrogenpotassium factorial experiment.
22
K
2
1
20
mean of Yield
K0
18
16
14
12
Note that the levels of nitrogen are equally spaced. As a next step
we could fit orthogonal polynomials to see if the relationship
between yield and nitrogen is quadratic (levels off), as perhaps
suggested by the interaction plot. From the interaction plot it seems
that perhaps the effect of K increases with increasing N, but the
differences are too small to say anything with certainty (about K)
from this experiment.
0
4
8
N
Figure 7.4: Interaction plot for nitrogen
and potassium factorial experiment.
sta2005s: design and analysis of experiments
Example 1: Bond Strength cont.
Return to the bond strength example from the beginning of the
chapter.
Model
Yijk = µ + αiA + β Bj + (αβ)ijAB + eijk
∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0
i
j
Analysis of Bond Strength of Glass-Glass Assembly
Cell Means and Standard Deviations
Adhesive Cross-lap Square-Centre Round-Centre
047
17.2
18.0
16.8
(2.2)
(3.5)
(3.3)
00T
20.6
18.8
24.6
(1.81)
(4.6)
(2.9)
001
22.4
21.8
16.4
(6.4)
(7.1)
(2.0)
Cross-Lap and Square-Centre assembly with adhesive 001 appear to
be more variable than the other treatments — but this is not
significant:
A modern robust test for the homogeneity of variances across
groups is Levene’s test. It is based on absolute deviations from the
group medians. It is available in R as leveneTest from package car.
----------------------------------------------------------Levene’s Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group
8
1.0689
0.406
36
-----------------------------------------------------------
The null hypothesis is that all variances are equal, therefore there
were no significant differences between the variances. Here is the
ANOVA table:
--------------------------------------------------Df Sum Sq Mean Sq F value
Pr(>F)
adhesive
2 127.51
63.756
3.6432 0.03623
assembly
2
4.98
2.489
0.1422 0.86791
adhesive:assembly
4 196.09
49.022
2.8013 0.04015
135
sta2005s: design and analysis of experiments
Residuals
36 630.00
17.500
---------------------------------------------------
Adhesive and assembly method interact in their effects on strength
(p = 0.04). So we look no further at the main effects but instead
look at the interaction plots (Figure 7.5) to give us an idea of how
these factors interact to influence strength.
adhesive
24
24
assembly
18
20
22
001
00T
047
mean of strength
22
20
18
mean of strength
square−centre
cross−lap
round−centre
001
00T
047
adhesive
cross−lap
round−centre
square−centre
assembly
The round-centre assembly method works well with adhesive 00T
(best of all combinations), but relatively poorly with the other two
adhesives. The other two assembly methods seem to work best with
adhesive 001, intermediate with 00T and worst with 047. We could
now do some post-hoc tests (with corrections for multiple testing)
to see for example whether the two assembly methods
(square-centre and cross-lap) make any difference.
Questions:
1. What experimental design do you think has been used?
2. Refer back to the ANOVA table, to the assembly line. Does the
large p-value imply that assembly method has no effect on
strength? With the help of the interaction plots briefly explain.
Example 3
A hypothetical experiment from Dean and Voss (1999): In 1994 the
Statistics Department at a university introduced a self-paced
version of a data analysis course that is usually taught by lecture.
Suppose the department is interested in student performance with
each of the 2 methods, and also how student performance is
affected by the particular instructor teaching the course. The
students are randomly assigned to one of the six treatments.
Figure 7.6 shows different hypothetical interaction plots that could
result from this study. The y-axis shows student performance, the
Figure 7.5: These interaction plots
show (a) the mean bond strength
for adhesive at different levels of
assembly, and (b) the mean bond
strength for assembly at different
levels of adhesive.
136
sta2005s: design and analysis of experiments
●
1
2
3
●
instructor
●
SP
L
3.0
0.0
3.0
2.0
instructor
3
method
●
●
●
2
3
SP
L
1.0
mean of M1
3
instructor
2
●
method
3.0
SP
L
2.0
●
instructor
method
●
●
L
SP
●
●
0.0
1.0
●
mean of M1
●
1
3.0
●
●
1
method
1.0
2.0
●
1
2
instructor
3
●
method
●
0.0
0.0
●
SP
L
2
instructor
●
3.0
3
mean of M1
2
instructor
2.0
3.0
method
0.0
2
●
1
1.0
3.0
2.0
1.0
1
2.0
mean of M1
●
0.0
●
1.0
Student Performance
L
SP
0.0
mean of M1
●
●
method
●
1
mean of M1
●
2.0
●
1.0
●
●
1.0
2.0
L
SP
mean of M1
3.0
method
0.0
mean of M1
x-axis shows instructor. In which of these do method and instructor
interact?
3
●
1
Instructor
2
instructor
3
SP
L
Figure 7.6: Possible configurations
of effects present for two factors,
Instructor and Teaching Method.
137
8
Some other experimental designs and their models
8.1 Fixed and Random effects
So far we have assumed that the treatments used in an experiment
are the only ones of interest. We aimed to estimate the treatment
means and to compare differences between treatments. If the
experiment were to be repeated, the same treatments would be
used again. This means that each factor used in defining the
treatments would have the same levels. When this is the case we
say that the treatments or the factors defining them are fixed. The
model is referred to as a fixed effects model. The simplest fixed
effects model is the completely randomised design, or one-way
lay-out, in which a treatments are compared. N experimental units
are randomly assigned to the a treatments, usually with n subjects
per treatment and the jth observation on the ith treatment has the
structure
Yij = µ + αi + eij
i
j
∑ αi
(8.1)
= 1, . . . , a
= 1, . . . , n
= 0
where
µ
αi
eij
=
=
=
general mean
effect of treatment i
random error such that eij ∼ N (0, σ2 )
Example
The Department of Clinical Chemistry was interested in comparing
the measurements of cholesterol made by 4 different laboratories in
the Western Cape. Since the Department was only interested in
these four laboratories, if they decided to repeat the experiment,
sta2005s: design and analysis of experiments
they would send samples to the same four laboratories. Ten
samples of blood were taken and each sample was divided into
four parts, and one part was sent to each laboratory. The
determination of the cholesterol levels were returned to the
Department and the results analysed using a one-way analysis of
variance (model (9.1)). Here, the parameter αi measured the effect
of the ith laboratory. Significant differences between the αi ’s would
mean that some laboratories tended to return, on average, higher
values of cholesterol than others.
Now consider the other situation. Suppose the Department believes
that the determination of cholesterol levels varies from laboratory
to laboratory about some mean value and they want to measure the
amount of variation. Now they are not interested in any particular
laboratory, so they select 4 laboratories at random from a list of all
laboratories that perform such analyses, and send each laboratory
ten samples as before.
Now if this experiment were repeated, there is very little chance of
the same four laboratories being used so, the effect of the laboratory
is random. We now write a model for the jth determination from the
ith laboratory as:
Yij = µ + ai + eij
(8.2)
Here we assume that ai is a random variable such that E( ai ) = 0
and Var ( ai ) = σa2 , where σa2 measures the component in the
variance of Yij that is due to laboratory. We also assume that ai is
independent of eij . The term eij has E(eij ) = 0 and Var (eij ) = σe2 .
Hence
E(Yij )
Var (Yij )
= µ
= σa2 + σe2
i
j
= 1, . . . , a
= 1, . . . , b
This is called the random effects model or the variance components
model. To distinguish between a random or fixed factor, ask
yourself this question: If the experiment were repeated, would I
observe the same levels of A? If the answer is Yes, then A is fixed. If
No, then A is random.
Note that the means of the observations do not have a structure. In
more complex situations it is possible to formulate models in which
some effects are fixed and others are random. In this case a
structure would be defined for the mean. For a two-way
classification with A fixed and B random
139
sta2005s: design and analysis of experiments
Yij = µ + αiA + b j + eij
(8.3)
where αi is a fixed effect of A, and b j is the value of a random
variable with E(b j ) = 0 and Var (b j ) = α2b , due to the random effect
of B. The model (9.3) is called a mixed effects model.
8.2 The Random Effects Model
Assume that the levels of factor A are a random sample of size a
from a large population of levels. Assume n observations are made
on each level of A. Let Yij be the jth observation on the ith level of
A, then
Yij = µ + ai + eij
where
µ
ai
eij
ai and eij
is the general mean
is a random variable with E( ai ) = 0 and Var ( ai ) = σa2
is a random variable with E(eij ) = 0 and Var (eij ) = σe2
are uncorrelated
Further it is usually assumed that ai and eij are normally
distributed. The Analysis of Variance table is set up as for the
one-way fixed effects model.
To test hypotheses about σa2 or to calculate confidence intervals, we
need
• an unbiased estimate of σe2
• an unbiased estimate of σa2
The fixed and random effects models are very similar and we can
show that the fixed effects MSE provides an unbiased estimate of
σe2 , i.e. E( MSE) = σ2 (the proof is left as an exercise).
Does the fixed-effects mean square for treatments, MS A , provide an
unbiased estimate for σa2 ? Not quite! However, we can show that
MS A is an unbiased estimator for nσa2 + σe2 , i.e. E( MS A ) = nσa2 + σe2 .
Then
E(
MS A − MSE
) = σa2
n
NOTE: this estimator can be negative, even though σa2 cannot be
zero. This will happen when MSE > MS A and is most likely to
happen when σa2 is close to zero. If MSE is considerably greater
than MS A , the model should be questioned.
140
sta2005s: design and analysis of experiments
141
Testing H0 : σa2 = 0 versus Ha : σa2 > 0
Can we use the same test as we used in the fixed effects model for
testing the equality of treatment effects?
If H0 is true, then
σa2
=⇒ E( MS A )
=⇒
E( MS A )
E( MSE)
= 0
= 0 + σe2
= σe2
= E( MSE)
= 1
However, if σa2 is large, the expected value of the numerator is larger
than the expected value of the denominator and the ratio should be
large and positive, which is a similar situation to the fixed effects
case.
Source
SS
between groups
within groups (error)
total
∑i ni (Y i· − Y ·· )2
∑i ∑ j (Yij − Y i· )2
∑i ∑ j (Yij − Ȳ.. )2
Does
that
MS A
MSe
df
Mean Square
F
EMS
a−1
N−a
N−1
SS A
a −1
SSe
N −a
MS A
MSe
nσa2 + σe2
σ2
∼ Fa−1;N −a under H0 ? We can show (see exercise below)
SS A
∼ χ2a−1
nσa2 + σe2
and that
SSE
∼ χ2N −a
σe2
and that SS A and SSE are independent (Cochran’s Theorem).
Under H0 , σa2 = 0 and
MS A
(nσa2 +σe2 )
MSe
σe2
Under H1 ,
Exercise:
MS A
MSE
Table 8.1: Anova for simple random
effects model, with expected mean
squares (EMS):
=
MS A
MSE
∼
Fa−1,n− a
has a non-central F distribution.
sta2005s: design and analysis of experiments
1. Show that Cov(Yis , Yit ) = σa2 , where Yis and Yit are two
observations in group i.
2. Use the above to show that (for the simple random effects model)
2
Var (Ȳi. ) = σa2 + σn . This implies that the observed variance
i
between group means does not directly estimate σa2 .
3. Hence show that
of (Ȳi. − Ȳ.. )2 ]
SS A
nσa2 +σe2
∼ χ2a−1 . [Hint: Consider the distribution
Expected Mean Squares for the Random Effects Model
SSE =
∑ ∑ Yij2 − ∑ ni Ȳi.2
i
Now for any random variable X
Var ( X ) = E( X 2 ) − [ E( X )]2
Then
E[Yij2 ]
= Var (Yij ) + [ E(Yij ]2
= σa2 + σe2 + µ2
Ȳi. = µ + ai +
Var (Ȳi. )
E(SSE)
1
ni
= σa2 +
∑ eij
σ2
ni
E(Ȳi. )
= µ
E(Ȳi.2 )
= (σa2 +
σ2
+ µ2 )
ni
σ2
=
∑ ∑(σa2 + σ2 + µ2 ) − ∑ ni (σa2 + ni
=
Nσ2 − νσ2
+ µ2 )
i
= ( N − ν ) σ2
E( MSE)
= σ2
Above, N = ∑ ni , and ν is the number of random effect levels.
SSA =
∑ ni Ȳi.2 − NȲ..2
142
sta2005s: design and analysis of experiments
Ȳ.. = µ +
E(Ȳ.. ) = µ,
1
N
1
∑ ni ai + N ∑ ∑ eij
Var (Ȳ.. ) =
Var (Ȳi. ) = σa2 +
E(Ȳi. ) = µ,
E(SSA)
=
=
E( MSA)
N
∑ n2i 2
σ + 2 σ2
N2 a
N
σ2
ni
#
"
σ2
∑ n2i 2 σ2
2
2
σ +
+µ
+
+µ )−N
∑
ni
N
N2 a
i
"
#
∑ n2i
N−
σa2 + (ν − 1)σ2
N
ni (σa2
= = cσa2 + σ2
Thus
E
MSA − MSE
= σa2
c
If all ni = n then c = n.
Rather than testing whether or not the variance of the population of
treatment effects is zero, one may want to test whether the variance
is equal to (or less than) some proportion of the error variance, i.e,
H0 :
σa2
= θ0
σe2
,
for some constant θ0
We can use the same F-statistic, but reject H0 if
F > (1 + nθ0 ) Faα−1;n− a .
Variance Components
Usually estimation of the variance components are of greater
interest than the tests. We have already shown that
σ̂a2 =
( MSA − MSE)
n
and
σ̂e2 = MSE
143
sta2005s: design and analysis of experiments
An ice cream experiment
To determine whether or not flavours of ice cream melt at different
speeds, a random sample of three flavours were selected from a large
population of flavours. The three flavours of ice cream were stored
in the same freezer in similar-sized containers. For each observation
one teaspoonful was taken from the freezer, transferred to a plate,
and the melting time at room temperature was observed to the
nearest second. Eleven observations were taken on each flavour:
Flavour
1
2
3
Melting Times (sec)
24
1125
891
994
817
846
876
1075
982
960
1032
840
1150
1066
1041
889
844
848
1053
977
1135
967
841
848
1041
886
1019
838
785
832
1037
1093
823
Anova Table:
Source
Flavour
Error
Total
df
2
30
32
SS
173 009.8788
203 456.1818
376 466.0306
MS
86 504.9394
6 781.8727
F
12.76
p
0.0001
An unbiased estimate for σ̂e2 = 6 781.8727 secs2 . An unbiased
estimate for σa2 is
σ̂a2 =
MSA − MSE
86504.9394 − 6781.8727
=
= 7247.5515
n
11
secs2
Ha : σa2 = 0 vs Ha > 0 can be tested using F = 12.76, p = 0.0001
where the p-value comes from the F2;30 distribution.
In such an experiment there will be a lot of error variability in the
data due to fluctuations of room temperature and the difficulty of
determining the exact time at which the ice cream has melted
completely. Hence variability in melting times of different flavours
(σa2 ) is unlikely to be of interest unless it is larger than the error
variance:
H0 : σa2 ≤ σe2 vs H0 : σa2 > σe2 ≡ H0 :
σa2
σe2
≤ 1 vs H0 :
σa2
σe2
>1.
Again use F = 12.76 but compare with (1 + 11 × 1) F2;30 = 12F2;30
=⇒ there is no evidence that variation between flavours is larger
than the error variance.
144
sta2005s: design and analysis of experiments
8.3 Nested Designs
Nested designs are common in sampling designs, and less common
for real experiments. However, many of the principles of analysis,
and ANOVA still apply for such carefully designed studies. For
example, the data are still balanced.
Nested designs that are more like real experiments occur in animal
breeding studies, e.g. where a bull is crossed with a number of
cows, but the cows are nested in bull. Also in microbiology, where
you can have daughter clones from mother clones (bacteria or
fungi).
Suppose a study is conducted to compare the domestic
consumption of electricity in three cities. In each city three streets
are selected at random and the annual amount of electricity used in
three randomly selected houses in each street is recorded. This is a
sampling design.
H1
H2
H3
City 1
S1
S2
8876 9141
8745 9827
8601 10420
S3
9785
9585
9009
S1
9483
8461
8106
City 2
S2
S3
10049 9975
9720 11230
12080 10877
S1
9990
9218
9472
City 1
S2
S3
10023 10451
10777 11094
11839 10287
At first glance these data appear to be a 3-way cross-classification
(analogous to a 3-factor factorial experiment) with factor cities (C);
streets (S) and houses (H). However, note the crucial differences:
Street 1 in city 1 is different from Street 1 in city 2, and from Street
1 in City 3. So even though they have been given the same label,
they are in fact quite different.
To be precise we should really label the streets as a factor with 9
levels, S1, S2, ... , S9 since there are 9 different streets. The same
remarks apply to the houses. We say that the streets are nested in
the cities, since to locate a street we must also state the cities and
likewise, the houses are nested in the streets. We denote nested
factors as S(C). The effects associated with the factor will have 2
subscripts, bij , where i denotes the level of C and j the levels
Clearly we need another model. Since a factor S-Street is nested in
the Cities, and the H-House is nested in the street. Also, since the
streets and houses within each city were sampled, if the study were
repeated the same houses and streets would not be selected again
(assuming there is a large number of streets and houses in each
city). So, both S and H are random factors. If Yijk is the amount of
electricity consumed by the kth household in the jth street in the ith
city.
city
Yijk = µ + αi
house
+ bijstreet + eijk
145
sta2005s: design and analysis of experiments
i
j
k
= 1, 2, 3
= 1, 2, 3
= 1, 2, 3
where
• αi is the fixed effect of the cities, ∑ αi = 0
• bij is the random effect of the jth street in city i
• E(bij ) = 0, Var (bij ) = σb2
• eijk is the random effect of the kth house in the jth street in the ith
city
• E(eijk ) = 0, Var (eijk ) = σe2
• bij and eijk are independent.
Note that:
E(Yijk ) = µ + αi and Var (Yijk ) = σb2 + σe2
The aim of the analysis is:
1. To estimate the mean consumption in each city, and to compare
the mean consumptions
2. To estimate the variance component due to streets and to
households within streets.
We assume that these components are constant over the three cities.
Ȳ···
(Ȳi·· − Ȳ··· )
(Ȳij· − Ȳi·· )
(Ȳijk − Ȳij· )
estimates µ
estimates αi 1=1, ..., a
measures the contribution from the jth street in city i
measures the contribution from the kth house in the jth street in the ith city
We can construct expressions for the ANOVA table from the
identity
(Yijk − Ȳ··· ) = (Ȳi·· − Ȳ··· ) + (Ȳij· − Ȳi·· ) + (Ȳijk − Ȳij· )
Squaring and summing over ijk, the cross products vanish on
summation and we find the sums of squares are
∑(Yijk − Ȳ··· )2 = nb ∑(Ȳi·· − Ȳ··· )2 + n ∑ ∑(Ȳij· − Ȳi·· )2 + ∑(Ȳijk − Ȳij· )2
ijk
i
i
j
ijk
We denote these sums of squares as SStotal = SSC + SSS(C) + SSE
and they have abn − 1 = ( a − 1) + a(b − 1) + ab(n − 1) degrees of
freedom.
146
sta2005s: design and analysis of experiments
Here we assume that there are a cities, b streets are sampled in each
city and n houses in each street. Note that the last term SSE should
strictly speaking be written as SS H (S(C)) . However, it is exactly the
same expression as would be evaluated for an error sum of squares,
so it is called SSE. We give the ANOVA table and state the values
for the expected mean squares. A complete derivation is given in
Scheffé (1959).
Source
SS
df
MS
EMS
SSC
( a −1)
SSS(C)
a ( b −1)
SSE
ab(n−1)
σe2 + nσb2 + bn (a−1i )
Cities (Fixed)
SSC
(a-1)
Streets (Random)
SSS(C)
a(b-1)
Houses (Random)
SSE
ab(n-1)
∑ α2
σe2 + nσb2
σe2
To test: H0 : αi = αi = . . . = α a = 0 versus some H1 , some of or all
of the α’s differ, we refer to the EMS column, and see that if H0 is
true, ∑ α2i = 0.
so
E( MSC ) = σe2 + nσb2 = E( MSS(C) )
So the statistic to test H0 is
MSC
∼ Fa−1,a(b−1)
MSS(C)
F=
Note the denominator!
( MS
− MSe )
S(C )
σb2 is estimated by
n
σe2 is estimated by MSe
These are method of moment estimators. The maximum likelihood
estimates can also be found. See Graybill.
Calculation of the ANOVA table
No special program is needed for a balanced nested design. Any
program that will calculate a factorial ANOVA can be used.
For our example we calculate the ANOVA table as though is were a
three-way cross classification with factors C,S and H. The ANOVA
table is:
Source
Cities
S Streets
H Houses
CS
CH
SH
CSH
SS
488.4
1090.6
49.1
142.6
32.3
592.6
203.3
df
2
2
2
4
4
4
8
147
sta2005s: design and analysis of experiments
The sum of squares for streets is
SSS(C)
SSS + SSSC
1090.6 + 142.6
with 2 + 4
=
=
=
=
1233.3
6 df
The sum of squares for houses within streets is
SS H (SC)
=
=
SS H + SSCH + SSSH + SSCSH
877.3 with 18 df
So the ANOVA table is
Source
Cities
Streets
Houses
SS
488.4
1233.3
877.3
df
2
6
18
MS
244.20
205.55
48.7
F
1.19
4.22
The F tests for differences between cities is
MSC
244.2
= 1.19
=
MSS(C)
205.5
This is distributed as F2,6 and is not significant. Conclude that there
is no difference in mean consumption of electricity between two
cities.
To test H0 : σs2 = 0 use
MSS(C)
MSe
=
205.5
= 4.22 ∼ F6,18
48.7
Reject H0 .
To estimate σs2 use
MSS(C) − MSe
n
Then
σ̂s2 =
205 − 48.7
= 52.25
3
and
σ̂e2 = 48.7
We note that the variation attributed to streets is about the same
size as the variation attached to houses. Since there is no significant
148
sta2005s: design and analysis of experiments
difference in mean consumption between cities, we estimate the
mean consumption as
Ȳ··· = 9896.85
Var (Ȳ··· ) =
σs2 + σe2
52.25 + 48.7
=
= 3.74
abn
3×3×3
For further reading on these designs see Dunn and Clark (1974),
Applied Statistics: Analysis of Variance and Regression.
The Nested Design
Schematic Representation
Factor B is nested in Factor A
A has “a" levels. B has “b" levels. n observation taken on each B is
A. Levels of A are fixed, levels of B are random (sampled).
a=3 b=2 n=4
A1
B1
B2
Ȳ11· Ȳ12·
Ȳ1··
A2
B1
B2
Y222
Ȳ21· Ȳ22·
Ȳ2··
Ȳ···
A3
B1
B2
Ȳ31· Ȳ32·
Ȳ3··
Yijk = kth observation at the jth level of B nested in the ith level of A
Yijk = µ + αi + bij + eijk
∑ αi = 0
bij ∼ N (0, σB2 )
eijk ∼ N (0, σe2 )
bij and eijk
independent for all ijk.
Main effect of 2nd level of A is estimated by Ȳ2·· − Ȳ···
Main effect of the level of B nested in 2nd level of A is b22 estimated
by Ȳ22· − Ȳ2··
The error = e222 is estimated by Y222 − Ȳ22·
Factor B can also be regarded as fixed. Instead of estimating an
overall variance for factor B, the contribution of the levels of B
within each level A is of interest. The sum of squares for B (nested
in A) is
a
b
i
j
SSB( A) = n ∑ ∑(Yij· − Ȳi·· )2
149
sta2005s: design and analysis of experiments
and has a(b-1) degrees of freedom. This can be split into “a"
component sums of squares each with (b-1) degrees of freedom
SSB( A) = SSB( A1 ) + SSB( A2 ) + . . . SSB( Aa )
where SSB( Ai ) = n ∑bj=1 (Ȳij − Ȳi·· )2
So tests of the levels of B nested in level i of A can be made using
MSB( A )
i
MSe
Estimates of parameters of the nested design
Parameter
Estimate
µ
Ȳ···
µ + αi
Ȳi··
αi
Ȳi·· − Ȳ···
(α1 − α2 )etc
Ȳ1·· − Ȳ2··
∑ hi ( µ + αi )
σe2
2
σe + nσb2
∑1a hi Ȳi··
s2 = MSe
MSB( A)
σb2
( MSB( A) − MSe )
n
Variance
(σe2 +σb2 )
abn
(σe2 +nσb2 )
bn
(σe2 +nσb2 )( a−1)
abn
(σe2 +nσb2 )
bn
(σe2 +nσb2 )(∑ h2i )
bn
−
−
−
1. Confidence levels for linear combinations of the means can be
found. The variance will be estimated by MSB( A) .
2. We have assumed that the levels of B are sampled from an
infinite (or very large) “population" of levels. If there are not a
large number of possible values of the levels of B, a correction
factor is included in EMS for A. For example, if b levels are
drawn from a possible K levels in the populations then
E( MS A ) = σe2 + n(1 − Kn )σ2 +
nb ∑ α2i
a −1 .
8.4 Repeated Measures
A form of experimentation often used in medical and psychological
studies is one in which a number of subjects are measured on
several occasions or undergo several different treatments. The aim
of the experiment is to compare treatments or occasions. It is hoped
that by using the same subjects on each treatment more sensitive
comparisons can be made because variation will be reduced. The
treatments are regarded as a fixed effect and the subjects are
assumed to be sampled from a large population of subjects so we
have again a mixed model. More complex experimental set-ups
than the one described here are used, for details see Winer (1971).
The general theory of balanced mixed models is given in Scheffé
150
sta2005s: design and analysis of experiments
(1959). Other methods for such data are given on Hand and
Crowder (1995). Repeated measures data, which is the term to
describe the data from such experiments, can also be analysed by
methods of multivariate analysis. However, for relatively small
numbers of subjects, the ANOVA methods described here are
useful.
Example: A psychologist is studying memory retention. She takes
10 subjects and each subject is asked to learn 50 nonsense words.
She then tests each subject 4 times: after 12, 36, 60 and 84 hours
after learning the words. On each occasion she scores the subject’s
performance. The data have the form:
Times
1
Y11
.
.
.
1
.
i
4
Subjects
2 ... j ...
Y12
.
.
.
.
Yij
.
.
10
.
.
.
Y4,10
where Yij = score of subject j at time i (i = 1 ... 4 ; j = 1 ... 10).
Interest centres on comparing the recalls at each time. If the same
subjects had not been tested each time, the design would have been
a completely randomised design. However, it is not because of the
repeated measurement on each subject.
Let
Yij = µ + αi + b j + eij
i
j
=
=
1, . . ., a
1, . . ., b
where
µ
αi
bj
eij
=
=
=
=
general mean
effect of the ith occasion or treatment
effect of the jth subject
random error
Assume
E(b j ) = 0 and Var (b j ) = σb2
E(eij ) = 0 and Var (eij ) = σe2
and that b j and eij are independent
From the formulation, we see that we have a 2-way design with one
fixed effect (times) and one random effect (subjects). Formally it
appears to be the same as a randomised block design with subjects
151
sta2005s: design and analysis of experiments
forming the block, but there is one important difference: the times
could not have been randomly assigned in the blocks. Thus we
have, in our example with 10 subjects, exactly 10 experimental units
receiving 4 treatments each. Strictly speaking with a randomised
block design we would have had 40 experimental units, arranged in
10 blocks with four homogenous units in each. The units within a
block would have been assigned at random to the treatments,
giving 40 independent observations. The observations are
independent both within each block and between blocks. With
repeated measurement data, the four observations within a block
all made on the same subject are possibly dependent, even if the
treatments can be randomly assigned. Of course the observations
on different subjects are independent. If the observations made on
the same subject are correlated the data strictly speaking should be
handled as a multivariate normal vector with mean µ and
covariance matrix Σ. Tests of hypothesis about the mean vector of
treatments can be made (Morrison). However, if the covariance
matrix Σ has a pattern such that σij = σii + σjj − 2σij is constant for
all ij, then the ANOVA approach used here is valid. It can also be
shown that for a small number of subjects the ANOVA test has
greater power.
We proceed with the calculations in exactly the way as we did for
the Randomised Block Design, and obtain the same ANOVA table
(consult the earlier notes for the exact formulae and calculations).
The ANOVA table is:
Source
df
MS
Expected Mean Square
occasion (fixed)
a-1
MSo
σe2 + σo2 +
subject (random)
O×S
b-1
(a-1)(b-1)
MSsub
MSo×sub
2
σe2 + σsub
2
σe2 + σos
b ∑ α2
( a −1)
From the Expected Mean Squares column, we see that the
hypothesis of H0 : No differences between occasions/treatments, or,
α1 = α2 = . . . ... = α a = 0, can be tested using
F=
MSo
∼ F(a−1);(a−1)(b−1)
MSo×sub
The test is identical to that of the treatment in the randomised block
design.
For this and all other more complex repeated measurement designs,
see Winer Chapter 4. For other methods of handling repeated
measurement data see Hand and Crowder (1996).
152
sta2005s: design and analysis of experiments
References
1. Hand, D and Crowder (1996). Practical Longitudinal Data
Analysis. Chapman and Hall. Texts in Statistical Sciences
2. Winer, B.J. (1971). Statistical Principles in Experimental Design.
McGraw Hill. Gives a detailed account of the ANOVA Approach to
repeated measures.
153
Download