University of Cape Town Department of Statistical Sciences Design of Experiments Course Notes for STA2005S J. Juritz, F. Little, B. Erni 3 1 5 2 4 5 2 1 4 3 1 5 4 3 2 August 31, 2022 2 4 3 5 1 4 3 2 1 5 sta2005s: design and analysis of experiments 2 Contents 1 Introduction 8 1.1 Observational and Experimental Studies 1.2 Definitions 8 9 1.3 Why do we need experimental design? 10 1.4 Replication, Randomisation and Blocking 1.5 Experimental Design 11 12 Treatment Structure 12 Blocking structure 14 Three basic designs 15 1.6 Design for Observational Studies / Sampling Designs 1.7 Methods of Randomisation 16 17 Using random numbers for a CRD (completely randomised design) Using random numbers for a RBD Randomising time or order 17 18 1.8 Summary: How to design an experiment 18 17 sta2005s: design and analysis of experiments 2 Single Factor Experiments: Three Basic Designs 2.1 The Completely Randomised Design (CRD) 2.2 Randomised Block Design (RBD) 2.3 The Latin Square Design Randomisation 2.4 An Example 3 4 20 20 20 21 21 22 The Linear Model for Single-Factor Completely Randomized Design Experiments 3.1 The ANOVA linear model 24 3.2 Least Squares Parameter Estimates Linear model in matrix form Speed-reading Example 24 26 28 The effect of different constraints on the solution to the normal equations Design matrices of less than full rank 32 Parameter estimates for the single-factor completely randomised design 3.3 Standard Errors and Confidence Intervals 38 Important estimates and their standard errors Speed-reading Example 39 3.4 Analysis of Variance (ANOVA) 40 Decomposition of Sums of Squares 40 Distributions for Sums of Squares 41 F test of H0 : α1 = α2 = · · · = α a = 0 ANOVA table Some Notes 44 44 Speed-reading Example 46 30 43 39 32 24 sta2005s: design and analysis of experiments 3.5 Randomization Test for H0 : α1 = . . . = α a = 0 47 3.6 A Likelihood Ratio Test for H0 : α1 = . . . = α a = 0 3.7 Kruskal-Wallis Test 4 50 53 Comparing Means: Contrasts and Multiple Comparisons 4.1 Contrasts 56 Point estimates of Contrasts and their variances 4.2 Orthogonal Contrasts 56 58 Calculating sums of squares for orthogonal contrasts 4.3 Multiple Comparisons: The Problem 60 63 4.4 To control or not to control the experiment-wise Type I error rate Exploratory vs confirmatory studies 4.5 Bonferroni, Tukey and Scheffé Bonferroni Correction Tukey’s Method Scheffé’s Method 65 67 67 69 70 Example: Strength of Welds 71 4.6 Multiple Comparison Procedures: The Practical Solution 4.7 Summary 77 4.8 Orthogonal Polynomials 4.9 References 5 77 81 Randomised Block and Latin Square Designs Model 84 Sums of Squares and ANOVA Example 55 88 86 83 75 64 5 sta2005s: design and analysis of experiments 5.1 The Analysis of the RBD 90 Estimation of µ, αi (i = 1, 2, . . . a) and β j (j = 1, 2 . . . b) Analysis of Variance for the Randomised Block Design Example: Timing of Nitrogen Fertilization for Wheat 5.2 Missing values – unbalanced data Other uses of Latin Squares Missing Values 104 105 106 Example: Rocket Propellant 106 Blocking for 3 factors - Graeco-Latin Squares 108 Power and Sample Size in Experimental Design 109 109 6.2 Two-way ANOVA model Factorial Experiments 7.1 Introduction 99 103 Test of the hypothesis H0 : γ1 = γ2 = . . . = γ p = 0 7 95 102 Model for a Latin Square Design Experiment 6.1 Introduction 93 101 5.5 The Latin Square Design 6 91 96 5.3 Randomization Tests for Randomized Block Designs 5.4 Friedman Test 6 111 112 112 7.2 Basic Definitions 113 7.3 Design of Factorial Experiments 116 Example: Effect of selling price and type of promotional campaign on number of items sold Interactions Sums of Squares 117 119 116 sta2005s: design and analysis of experiments 7.4 The Design of Factorial Experiments Replication and Randomisation 120 120 Why are factorial experiments better than experimenting with one factor at a time? Performing Factorial Experiments 7.5 Interaction 122 123 7.6 Interpretation of results of factorial experiments 7.7 Analysis of a 2-factor experiment 7.8 Testing Hypotheses 124 126 128 7.9 Power analysis and sample size: 131 7.10 Multiple Comparisons for Factorial Experiments 7.11 Higher Way Layouts 7.12 Examples 8 132 132 133 Some other experimental designs and their models 8.1 Fixed and Random effects 137 8.2 The Random Effects Model 139 Testing H0 : σa2 = 0 versus Ha : σa2 > 0 Variance Components An ice cream experiment 8.3 Nested Designs 142 143 144 Calculation of the ANOVA table 146 Estimates of parameters of the nested design 8.4 Repeated Measures 140 149 149 137 121 7 1 Introduction 1.1 Observational and Experimental Studies There are two fundamental ways to obtain information in research: by observation or by experimentation. In an observational study the observer watches and records information about the subject of interest. In an experiment the experimenter actively manipulates the variables believed to affect the response. Contrast the two great branches of science, Astronomy in which the universe is observed by the astronomer, and Physics, where knowledge is gained through the physicist changing the conditions under which the phenomena are observed. In the biological world, an ecologist may record the plant species that grow in a certain area and also the rainfall and the soil type, then relate the condition of the plants to the rainfall and soil type. This is an observational study. Contrast this with an experimental study in which the biologist grows the plants in a greenhouse in various soils and with differing amounts of water. He decides on the conditions under which the plants are grown and observes the effect of this manipulation of conditions on the response. Both observational and experimental studies give us information about the world around us, but it is only by experimentation that we can infer causality; at least it is a lot more difficult to infer causal relationships with data from observational studies. In a carefully planned experiment, if a change in variable A, say, results in a change in the response Y, then we can be sure that A caused this change, because all other factors were controlled and held constant. In an observational study if we note that as variable A changes Y changes, we can say that A is associated with a change in Y but we cannot be certain that A itself was the cause of the change. Both observational and experimental studies need careful planning to be effective. In this course we concentrate on the design of experimental studies. sta2005s: design and analysis of experiments Clinical Trials1 : Historically, medical advances were based on anecdotal data; a doctor would examine six patients and from this wrote a paper and publish it. Medical practitioners started to become aware of the biases resulting from these kinds of anecdotal studies. They started to develop the randomized double-blind clinical trial, which has become the gold standard for approval of any new product, medical device, or procedure. 9 1 https://en.wikipedia.org/wiki/ Clinical_trial 1.2 Definitions For the most part the experiments we consider aim to compare the effects of a number of treatments. The treatments are carefully chosen and controlled by the experimenter. 1. The factors/variables that are investigated, controlled, manipulated in the experiment, are called treatment factors. Usually, for each treatment factor, the experimenter chooses some specific levels of interest, e.g. the factor ‘water level’ can have levels 1cm, 5cm, 10cm; the factor ‘background music‘ can have levels ‘Classical’, ‘Jazz‘, ’Silence’. 2. In a single-factor experiment2 the treatments will correspond to the levels of the treatment factor (e.g. for the water level experiment the treatments will be 1cm, 5cm, 10cm). With more than one treatment factor, the treatments can be constructed by crossing all factors: every possible combination of the levels of factor A and the levels of factor B is a treatment. Experiments with crossed treatment factors are called factorial experiments3 . More rarely in true experiments, factors can be nested (see Section 1.5). 3. Experimental unit: this is the entity to which a treatment is assigned. The experimental unit may differ from the observational or sampling unit, which is the entity from which a measurement is taken. For example, one may apply the treatment of ‘high temperature and low water level’ to a pot of plants containing 5 individual plants. Then we measure the growth of each of these plants. The experimental unit is the pot (this is where the treatment is applied), the observational units are the plants (this is the unit on which the measurement is taken). This distinction is very important because it is the experimental units which determine the (experimental) error variance, not the observational units. This is because we are interested in what happens if we independently repeat the treatment. For comparing treatments, we obtain one (response) value per pot (average growth in the pot), one value per experimental unit. 4. Experimental units which are roughly similar prior to the experiment, are said to be homogeneous. The more homogeneous the Single-factor experiment: only a single treatment factor is under investigation. 2 Factorial experiments are experiments with at least two treatment factors, and the treatments are constructed as the different combinations of the levels of the individual treatment factors. 3 Figure 1.1: Example of a 2 × 2 × 2 = 23 factorial experiment with three treatment factors (A, B and C), each with two levels (low and high, coded as - and +), resulting in eight treatments (vertices of cube) (Montgomery Chapter 6). sta2005s: design and analysis of experiments experimental units are the smaller the experimental error variance (variation between observations which have received the same treatment = variance that cannot be explained by known factors) will be. It is generally desirable to have homogeneous experimental units for experiments, because this allows us to detect the differences between treatments more clearly. 5. If the experimental units are not homogeneous, but heterogeneous, we can group sets of homogeneous experimental units and thereby account for differences between these groups. This is called blocking. For example, farms of similar size and in the same region could be considered homogeneous / more similar to each other than to farms in a different region. Farms (experimental units) in different regions will differ because of regional factors such as vegetation and climate. If we suspect that these differences will affect the response, we should block by region: similar farms (experimental units) are grouped into blocks. 6. In the experiments we will consider, each experimental unit receives only one treatment. But each treatment can consist of a combination of factor levels, e.g. high temperature combined with low water level can be one treatment, high temperature with high water level another treatment. 7. The treatments are applied at random to the experimental units in such a way that each unit is equally likely to receive a given treatment. The process of assigning treatments to the experimental units in this way is called randomisation. 8. A plan for assigning treatments to experimental units is called an experimental design. 9. If a treatment is applied independently to more than one experimental unit it is said to be replicated. Treatments must be replicated! Making more than one observation on the same experimental unit is not replication. If the measurements on the same experimental unit are taken over time there are methods for repeated measures, longitudinal data, see Ch.8. If the measurements are all taken at the same time, as in the pot with 5 plants example above, this is just pseudoreplication. Pseudoreplication is a common problem (Hurlbert 1984), and will invalidate the experiment. 10. We are mainly interested in the effects of the different treatments: by how much does the response change with treatment i relative to the overall mean response. 1.3 Why do we need experimental design? An experiment is almost the only way in which one can control all factors to such an extent as to eliminate any other possible expla- 10 sta2005s: design and analysis of experiments 11 nation for a change in response other than the treatment factor of concern. This then allows one to infer causality. To achieve this, experiments need to adhere to a few important principles discussed in the next section. Experiments are frequently used to find optimal levels of settings (treatment factors) which will maximise (or minimise) the response (especially in engineering). Such experiments can save enormous amounts of time and money. 1.4 Replication, Randomisation and Blocking There are three fundamental principles when planning experiments. These will help to ensure the validity of the analysis and to increase power: 1. Replication: each treatment must be applied independently to several experimental units. This ensures that we can separate error variance from differences between treatments. True, independent replication demands that the treatment is set up anew for each experimental unit, one should not set up the experiment for a specific treatment and then run all experimental units under that setting at the same time. This would result in pseudo-replication, where effectively the treatment is applied only once. For example, if we were interested in the effect of temperature on the taste of bread, we should not bake all 180°C loafs in the same oven at the same time, but prepare and bake each loaf separately. Else we would not be able to say whether the particular batch or particular time of day, or oven setting, or the temperature was responsible for the improved taste. 2. Randomisation: a method for allocating treatments to experimental units which ensures that: • there is no bias on the part of the experimenter, either conscious or unconscious, in the assignment of the treatments to the experimental units; • possible differences between experimental units are equally distributed amongst the treatments, thereby reducing or eliminating confounding. Randomisation helps to prevent confounding with underlying, possibly unknown, variables (e.g. changes over time). Randomisation allows us to assume independence between observations. Both the allocation of treatments to the experimental material and the order in which the individual runs or trials of the experiment are to be performed must be randomly determined! Confounding: When we cannot attribute the change in response to a specific factor but several factors could have contributed to this change, we call this confounding. sta2005s: design and analysis of experiments 12 3. Blocking refers to the grouping of experimental units into homogeneous sets, called blocks. This can reduce the unexplained (error) variance, resulting in increased power for comparing treatments. Variation in the response may be caused by variation in the experimental units, or by external factors that might change systematically over the course of the experiment (e.g. if the experiment is conducted on different days). Such nuisance factors should be blocked for whenever possible (else randomised). Examples of factors for which one would block: time, age, sex, litter of animals, batch of material, spatial location, size of a city. Blocking also offers the opportunity to test treatments over a wider range of conditions, e.g. if I only use people of one age (e.g. students) I cannot generalise my results to older people. However if I use different blocks (each an age category) I will be able to tell whether the treatments have similar effects in all age groups or not. If there are known groups in the experimental units, blocking guards against unfortunate randomisations. Blocking aims to reduce (or control) any variation in the experimental material, where possible, with the intention to increase power (sensitivity)4 . Another way to reduce error variance is to keep all factors not of interest as constant as possible. This principle will affect how experimental material is chosen. The three principles above are sometimes called the three R’s of experimental design (randomisation, replication, reducing unexplained variation). 1.5 Experimental Design The design that will be chosen for a particular experiment depends on the treatment structure (determined by the research question) and the blocking structure (determined by the experimental units available). Treatment Structure Single (treatment) factor experiments are fairly straightforward. One needs to decide on which levels of the single treatment factor to choose. If the treatment factor is continuous, e.g. temperature, it may be wise to choose equally spaced levels, e.g. 50, 100, 150, 200. This will simplify analysis when you want to fit a polynomial curve through this, i.e. investigate the form of the relationship between temperature and the response. If there is more than one treatment factor, these can be crossed, giving rise to a factorial experiment, or nested. When we later fit our model (a special type of regression model), we will add the blocking variable and hope that it will explain some of the total variation in our response. 4 sta2005s: design and analysis of experiments Often, factorial experiments are illustrated by a graph such as shown in Figure 1.2. This quickly summarizes which factors, factor levels and which combinations are used in an experiment. One important advantage of factorial experiments over onefactor-at-a-time experiments, is that one can investigate interactions. If two factors interact, it means that the effect of the one depends on the level of the other factor, e.g. the change in response when changing from level a1 to a2 (of factor A) depends on what level of B is being used. Often, the interesting research questions are concerned with interaction effects. Interaction plots are very helpful when trying to understand interactions. As an example, the success of factor A (with levels a1 and a2) may depend on whether factor B is present (b2) or absent (b1) (RHS of Figure 1.3). On the LHS of this figure, the success of A does not depend on the level of B. We can only explore interactions if we explore both factors in the same experiment, i.e. use a factorial experiment. A and B interact ● b1 ● b2 ● Y Y no interaction ● b1 ● b2 ● ● ● a1 a2 a1 a2 Nested Factors: When factors are nested the levels of one factor, B, will not be identical across all levels of another factor A. Each level of factor A will contain different levels of factor B. These designs are common in observational studies; we will briefly look at their analysis in Chapter 8. Example of nested factors: In an animal breeding study we could have two bulls (sires), and six cows (dames). Progeny (offspring) is nested within dames, and dames are nested within sires. progeny dam 1 1 2 sire 1 dam 2 3 4 dam 3 5 6 dam 4 7 8 sire 2 dam 5 9 10 dam 6 11 12 Blinding, Placebos and Controls: A control treatment is often b2 ● ● ● ● ● ● ● ● ● b1 ● ● ● ● ● ● ● ● ● Factor B Factorial Experiments: In factorial experiments the total number of treatments (and experimental units required) increases rapidly, as each factor level combination is included. For example, if we have temperature, soil and water level, each with 2 levels there are 2 × 2 × 2 = 8 combinations = 8 treatments. 13 low medium high Factor A Figure 1.2: One way to illustrate a 3 × 2 factorial experiment. The three dots at each treatment illustrate three replicates per treatment. Figure 1.3: On the left factors A and B do not interact (their effects are additive). On the right A and B interact, the effect of one depends on the level of the other factor. The dots represent the mean response at a certain treatment. The lines join treatments with the same level of factor B, for easier reference. sta2005s: design and analysis of experiments necessary as a benchmark to evaluate the effectiveness of the actual treatments. For example, how do two new drugs compare, but also are they any better than the current drug? Placebo Effect: The physician’s belief in the treatment and the patient’s faith in the physician exert a mutually reinforcing effect; the result is a powerful remedy that is almost guaranteed to produce an improvement and sometimes a cure (Follies and Fallacies in Medicine, Skrabanek & McCormick). The placebo effect is a measurable, observable or felt improvement in health or behaviour not attributable to a medication or treatment. A placebo is a control treatment that looks/tastes/feels exactly like the real treatment (medical procedure or pill) but with the active ingredient missing. The difference between the placebo and treatment group is then only due to the active ingredient and not affected by the placebo effect. To measure the placebo effect one can use two control treatments: a placebo and a no-treatment control. If humans are involved as experimental units or as observers, psychological effects can creep into the results. In order to pre-empt this, one should blind either or both observer and experimental unit to the applied treatment (single- or double-blinded studies). The experimental unit and / or the observer do not know which treatment was assigned to the experimental unit. Blinding the observer prevents biased recording of results, because expectations could consciously or unconsciously influence what is recorded. Blocking structure The most important aim of blocking is to reduce unexplained variation (error variance), and thereby to obtain more precise parameter estimates. Here one should look at the experimental units available: Are there any structures/differences that need to be blocked? Do I want to include experimental units of different types to make the results more general? How many experimental units are available in each block? For the simplest designs covered in this course, the number of experimental units in each block will correspond to the total number of treatments. However, in practice this can often not be achieved. The grouping of the experimental units into homogeneous sets called blocks and the subsequent randomisation of the treatments to the units in a block form the basis of all experimental designs. We will study three designs which form the basis of other more complex designs. They are: 14 sta2005s: design and analysis of experiments 15 Three basic designs 1. Completely Randomised Design This design is used when the experimental units are all homogeneous. The treatments are randomly assigned to the experimental units. 2. Randomised Block Design This design is used when the experimental units are not all homogeneous but can be grouped into sets of homogeneous units called blocks. The treatments are randomly assigned to the units within each block. 3. Latin Square Design This design allows blocking for two factors without increasing the number of experimental units. Each treatment occurs only once in every row block and once in every column block. In all of these designs the treatment structure can be a single factor or factorial (crossed factors). Completely Randomised Design Example: Longevity of fruiflies depending on sexual activity and thorax length5 125 male fruitflies were divided randomly into 5 groups of 25 each. The response was the longevity of the fruitfly in days. One group was kept solitary, while another was kept individually with a virgin female each day. Another group was given 8 virgin females per day. As an additional control the fourth and fifth groups were kept with one or eight pregnant females per day. Pregnant fruitflies will not mate. The thorax length of each male was measured as this was known to affect longevity. Randomised Block Design Example: Executives and Risk Executives were exposed to one of 3 methods of quantifying the maximum risk premium they would be willing to pay to avoid Figure 1.4: The three basic designs: Completely Randomised Design (left), Randomised Block Design (middle), Latin Square Design (right). In each design each of five treatments (colours) is replicated 5 times. Note how the randomisation was done: CRD: complete randomisation, RBD: randomisation of treatments to experimental units in blocks, LSD: each treatment once in each column, once in each row. The latter two are forms of restricted randomisation (as opposed to complete randomisation). Sexual Activity and the Lifespan of Male Fruitflies. L. Partridge and M. Farquhar. Nature, 1981, 580-581. The data can be found in R package faraway (fruitfly). 5 sta2005s: design and analysis of experiments 16 uncertainty in a business decision. The three methods are: 1) U: utility method, 2) W: worry method, 3) C: comparison method. After using the assigned method, the subjects were asked to state their degree of confidence in the method on a scale from 0 (no confidence) to 20 (highest confidence). Block 1 2 3 4 5 (oldest executives) (youngest executives) Experimental Unit 1 2 3 C W U C U W U W C W U C W C U Table 1.1: Layout and randomization for premium risk experiment. The experimenters blocked for age of the executives. This is a reasonable thing to do if they expected, for example, lower confidence in older executives, i.e. different response due to inherent properties of the experimental units (which here are the executives). The blocking factor is age, the treatment factor is the method of quantifying risk premium, the response is the confidence in method. The executives in one block are of a similar age. The three methods were randomly assigned to the three experimental units in each block. Latin Square Design Example: Traffic Light Signal Sequences A traffic engineer conducted a study to compare the total unused red light time for five different traffic light signal sequences. The experiment was conducted with a Latin square design in which the two blocking factors were (1) five randomly selected intersections and (2) five time periods. Intersection 1 2 3 4 5 1 15.2 (A) 16.5 (B) 12.1 (C) 10.7 (D) 14.6 (E) 2 33.8 (B) 26.5 (C) 31.4 (D) 34.2 (E) 31.7 (A) Mean 13.82 31.52 Time Period 3 4 13.5 (C) 27.4 (D) 19.2 (D) 25.8 (E) 17.0 (E) 31.5 (A) 19.5 (A) 27.2 (B) 16.7 (B) 26.3 (C) 17.18 27.64 5 29.1 (E) 22.7 (A) 30.2 (B) 21.6 (C) 23.8 (D) Mean 23.80 22.14 24.44 22.64 22.62 25.48 Ȳ... = 23.128 1.6 Design for Observational Studies / Sampling Designs In observational studies, design refers to how the sampling is done (on the explanatory variables), and is referred to as sampling design. The aim is, as in experimental studies, to achieve the best possible estimates of effects. The methods used to analyse data from observational or experimental studies are often the same. The conclusions will differ in that no causality can be inferred in observational studies. Table 1.2: Traffic light signal sequences. The five signal sequence treatments are shown in parentheses as A, B, C, D, E. The numerical values are the unused red light times in minutes. sta2005s: design and analysis of experiments 1.7 Methods of Randomisation Randomisation refers to the random allocation of treatments to the experimental units. This can be done using random number tables or using a computer or calculator to generate random numbers. When assigning treatments to experimental units, each permutation must be equally likely, i.e. each possible assignment of treatments to experimental units must be equally likely. Randomisation is crucial for conclusions drawn from the experiment to be correct, unambiguous and defensible! For completely randomised designs the experimental units are not blocked, so the treatments (and their replicates) are assigned completely at random to all experimental units available (hence completely randomised). If there are blocks, the randomisation of treatments to experimental units occurs in each block. In Practical 1 you will learn how to use R for randomisation. Using random numbers for a CRD (completely randomised design) This method requires a sequence of random numbers (from a calculator or computer; in the old days printed random number tables were available). To randomly assign 2 treatments (A and B) to 12 experimental units, 6 experimental units per treatment, you can: 1. Decide to let odd numbers ≡ treatment A, and even numbers ≡ treatment B. 79 A 76 B 49 A 31 A 93 A 54 B 17 A 36 B 91 A 50 B 11 38 B 87 2. or decide to assign treatment A for two-digit numbers 00 - 49, and treatment B for two-digit numbers 50 - 99. 67 B 49 A 72 B 48 A 95 B 39 A 03 A 22 A 46 A 87 B 71 B 16 A 70 B Using random numbers for a RBD Say we wish to randomly assign 12 patients to 2 treatments in 3 blocks of 4 patients each. The different (distinct) orderings of four patients, two receiving treatment A, two receiving treatment B are: 79 97 69 22 B 17 sta2005s: design and analysis of experiments A A A B B B A B B A A B B B A A B A B A B B A A to be chosen if random number between 01 - 10 11 - 20 21 - 30 31 - 40 41 - 50 51 - 60 Ignore numbers 61 to 99. 96 09 ↓ AABB ↓ block 1 58 ↓ BBAA ↓ block 2 89 23 ↓ ABAB ↓ block 3 71 38 Coins, cards, pieces of paper drawn from a bag can also be used for randomisation. Randomising time or order To prevent systematic changes over time from influencing results one must ensure that the order of the treatments over time is random. If a clear time effect is suspected, it might be best to block for time. In any case, randomisation over time helps to ensure that the time effect is approximately the same, on average, in each treatment group, i.e. treatment effects are not confounded with time. For the same reason one would block spatially arranged experimental units, or if this is not possible, randomise treatments in space. 1.8 Summary: How to design an experiment The best design depends on the given situation. To choose an appropriate design, we can start with the following questions: 1. Treatment Structure: What is the research question? What is the response? What are the treatment factors? What levels for each treatment factor should I choose? Do I need a control treatment? Are you interested in interactions? 2. Experimental Units: How many replicates (per treatment) do I need? How many experimental units can I get, afford? 3. Blocking: Do I need to block the experimental units? Do I need to control other unwanted sources of variation? Which factors should be kept constant? etc 18 sta2005s: design and analysis of experiments 4. Other considerations: ethical, time, cost. Will I have enough power to find the effects I am interested in? The treatment structure, blocking factors, and number of replicates required are the most important determinants of the appropriate design. Lastly, we need to randomise treatments to experimental units according to this design. 19 2 Single Factor Experiments: Three Basic Designs This chapter gives a brief overview of the three basic designs. 2.1 The Completely Randomised Design (CRD) This design is used when the experimental units are homogeneous. The experimental units will of course differ, but not so that they can be split into clear groups, i.e. no blocks seem necessary. This is before the treatments are applied. Each treatment is randomly assigned to r experimental units. Each unit is equally likely to receive any of the a treatments. There are N = r × a experimental units. Some advantages of completely randomized designs are: 1. Easy to lay out. 2. Simple analysis even when there are unequal numbers of replicates of some treatments. 3. Maximum degrees of freedom for error. An example of a CRD with 4 treatments, A, B, C and D randomly applied to 12 homogeneous experimental units: Units Treatments 1 B 2 C 3 A 4 A 5 C 6 A 7 B 8 D 9 C 10 D 11 B 2.2 Randomised Block Design (RBD) This design is used if the experimental material is not homogeneous but can be divided into blocks of homogeneous material. Before the treatments are applied there are no known differences 12 D sta2005s: design and analysis of experiments 21 between the units within a block, but there may be very large differences between units from different blocks. Treatments are assigned at random to units within a block. In a complete block design each treatment occurs at least once in each block (randomised complete block design). If there are not sufficient units within a block to allow all the treatments to be applied an Incomplete Block Design can be used (not covered here, see Hicks & Turner (1999) for details). Randomised block designs are easy to design and analyse. Usually, the number of experimental units in each block is the same as the number of treatments. Blocking allows more sensitive comparisons of treatment effects. On the other hand, missing data can cause problems in the analysis. Any known variability in the experimental procedure or the experimental units can be controlled for by blocking. A block could be: Table 2.1: Example of a randomised layout for 4 treatments applied in 3 blocks. Block 1 C A B D Block 2 B C A D Block 3 D B C A • A day’s output of a machine. • A litter of animals. • A single subject. • A single leaf on a plant. • Time of day or weekday. 2.3 The Latin Square Design A Latin Square Design allows blocking for two sources of variation, without having to increase the number of experimental units. Call these sources, row variation and column variation. The p2 experimental units are grouped by their row and column position. The p treatments are assigned so that they occur exactly once in each row and in each column. Table 2.2: A 4 × 4 Latin Square Design. C1 C2 C3 C4 R1 A B C D R2 B C D A R3 C D A B R4 D A B C Randomisation The Latin Square is chosen at random from the set of standard Latin squares of order p. Then a random permutation of rows is chosen, a random permutation of columns is chosen, and finally, the letters A, B, C, ..., are randomly assigned to the treatments. Latin square designs are efficient in the number of experimental units used when there are two blocking factors. However, the number of treatments must equal the number of row blocks and the Table 2.3: Latin square designs can be used to block for time periods and order of presentation of treatments. Order Period 1 ABCD Period 2 BCDA Period 3 CDAB Period 4 DABC sta2005s: design and analysis of experiments 22 number of column blocks. One experimental unit must be available at every combination of the two blocking factors. Also, the assumption of no interactions between treatment and blocking factors should hold. 2.4 An Example This example gives a brief overview of how the chosen design will affect analysis, and conclusions. The ANOVA tables look similar to the regression ANOVA tables you are used to, and are interpreted in the same way. The only difference is that we have a row for each treatment factor and for each blocking factor. An experiment is conducted to compare 4 methods of treating motor car tyres. The treatments (methods), labelled A, B, C and D are assigned to 16 tyres, four tyres receiving A, four others receiving B, etc.. Four cars are available, treated tyres are placed on each car and the tread loss after 20 000 km is measured. Consider design 1 in Table 2.4. This design is terrible! Apparent treatment differences could also be car differences: Treatment and car effects are confounded. We could use a Completely Randomized Design (CRD). We would assign the treated tyres randomly to the cars hoping that differences between the cars will average out. Table 2.6 is one such randomisation. To test for differences between treatments, an analysis of variance (ANOVA) is used. We will present these tables here, but only as a demonstration of what happens to the mean squared error (MSE) when we change the design, or account for variation between blocks. Table 2.7 shows the ANOVA table for testing the hypothesis of no difference between treatments, H0 : µ A = µ B = µC = µ D . There is no evidence for differences between the tyre brands. Is the Completely Randomised Design the best we can do? Note that A is never used on Car 3, and B is never used on Car 1. Any variation in A may reflect variation in Cars 1, 2 and 4. The same remarks apply to B and Cars 2, 3 and 4. The error sum of squares will contain this variation. Can we remove it? Yes - by blocking for cars. Even though we randomized, there is still a bit of confounding (between cars and treatments) left. To remove this problem we should block for car, and use every treatment once per car, i.e. use a Randomised Block Design. Differences between the responses to the treatments within a car will reflect the effect of the treatments. Table 2.4: Car design 1. Car 1 Car 2 Car 3 A B C A B C A B C A B C Car 4 D D D D Table 2.5: Car design 2: Completely Randomised Design. The numbers in brackets are the observed tread losses. Car 1 Car 2 Car 3 Car 4 C(12) A(14) D(10) A(13) A(17) A(13) C(10) D(9) D(13) B(14) B(14) B(8) D(11) C(12) B(13) C(9) Table 2.6: Car design 2 (CRD), with rearranged observations. Treatment Tread loss A 17 14 13 13 B 14 14 13 8 C 12 12 10 9 D 13 11 10 9 Table 2.7: ANOVA table for car design 2, CRD. Source df SS Mean Square Brands 3 33 11.00 Error 12 51 4.25 Table 2.8: Car design 3: Randomised Block Design. Car 1 Car 2 Car 3 Car 4 B(14) D(11) A(13) C(9) C(12) C(12) B(13) D(9) A(17) B(14) D(10) B(8) D(13) A(14) C(10) A(13) F stat 2.59 sta2005s: design and analysis of experiments The treatment sum of squares from the RBD is the same as in the CRD. The error sum of squares is reduced from 51 to 11.5 with the loss of three degrees of freedom. The F-test for treatment effects now shows evidence for differences between the tyre brands (Table 2.10). Another source of variation would be from the wheels on which a treated tyre was placed. To have a tyre of each type on each wheel position on each car would mean that we would need 64 tyres for the experiment, rather expensive! Using a Latin Square Design makes it possible to put a treated tyre in each wheel position and use all four treatments on each car (Table 2.11). Within this arrangement A appears in each car and in each wheel position, and the same applies to B and C and D, but we have not had to increase the number of tyres needed. Blocking for cars and wheel position has reduced the error sum of squares to 6.00 with the loss of 3 degrees of freedom (Table 2.12). The above example illustrates how the design can change results. In reality one cannot change the analysis after the experiment was run. The design determines the model and all the above considerations, whether one should block by car and wheel position have to be carefully thought through at the planning stage of the experiment. References 1. Hicks CR, Turner Jr KV. (1999). Fundamental Concepts in the Design of Experiments. 5th edition. Oxford University Press. 23 Table 2.9: Rearranged data for car design 3, RBD. Tread loss Treatment Car 1 Car 2 Car 3 A 17 14 13 B 14 14 13 C 12 12 10 D 13 11 10 Table 2.10: ANOVA table for car design 3, RBD. Source df SS Mean Square Tyres 3 33 11.00 Cars 3 39.5 13.17 Error 9 11.5 1.28 Car 4 13 8 9 9 F stat 8.59 10.28 Table 2.11: Car design 4: Latin Square Design. Wheel position Car 1 Car 2 Car 3 1 A B C 2 B C D 3 C D A 4 D A B Table 2.12: ANOVA table for car design 4, LSD. Source df SS Mean Square Tyres 3 33 11.00 Cars 3 39.5 13.17 Wheels 3 5.5 1.83 Error 6 6.0 1.00 Car 4 D A B C F stat 11.00 13.17 3 The Linear Model for Single-Factor Completely Randomized Design Experiments 3.1 The ANOVA linear model A single-factor completely randomised design experiment results in groups of observations, with (possibly) different means. In a regression context, one could write a linear model for such data as yi = β 0 + β 1 L1i + β 2 L2i + . . . + ei with L1, L2, etc. dummy variables indicating whether response i belongs to group j or not. However, when dealing with only categorical explanatory variables, as is typical in experimental data, it is more common to write the above model in the following form: Yij = µ + αi + eij (3.1) The dummy variables are implicit but not written. The two models are equivalent in the sense that they make exactly the same assumptions and describe exactly the same structure of the data. Model 3.1 is sometimes referred to as an ANOVA model, as opposed to a regression model. Both models can be written in matrix notation as Y = Xβ + e (see next section). 3.2 Least Squares Parameter Estimates Example: Three different methods of instruction in speed-reading were to be compared. The comparison is made by testing the comprehension of a subject at the end of one week’s training in the sta2005s: design and analysis of experiments given method. Thirteen students volunteered to take part. Four were randomly assigned to Method 1, 4 to Method 2 and 5 to Method 3. After one week’s training all students were asked to read an identical passage on a film, which was delivered at a rate of 300 words per minute. Students were then asked to answer questions on the passage read and their marks were recorded. They were as follows: Mean Std. Dev. Method 1 82 80 81 83 Method 2 71 79 78 74 81.5 1.29 75.5 3.70 Method 3 91 93 84 90 88 89.2 3.42 We want to know whether comprehension is higher for some of these methods of speed-reading, and if so, which methods work better. When we have models where all explanatory variables (factors) are categorical (as in experiments), it is common to write them as follows. You will see later why this parameterisation is convenient for such studies. Yij = µ + αi + eij where Yij is the jth observation with the ith method µ is the overall or general mean αi is the effect of the ith method / treatment Here αi = µi − µ, i.e. the change in mean response with treatment i relative to the overall mean. µi is the mean response with treatment i: µi = µ + αi . By effect we mean here: the change in response with the particular treatment compared to the overall mean. For categorical variables, effect in general refers to a change in response relative to a baseline category or an overall mean. For continuous explanatory variables, e.g. in regression models, we also talk about effects, and then mostly mean the change in mean response per unit increase in x, the explanatory variable. Note that we need 2 subscripts on the Y - one to identify the group and the other to identify the subject within the group. Then: Y1j = µ + α1 + e1j Y2j = µ + α2 + e2j 25 sta2005s: design and analysis of experiments Y3j = µ + α3 + e3j Note that there are 4 parameters, but only 3 groups or treatments. Linear model in matrix form To put our model and data into matrix form we string out the data for the groups into an N × 1 vector, where the first n1 elements are the observations on Group 1, the next observations n2 , etc.. Then the linear model, Y = Xβ + e, has the form Y= Y11 Y12 Y13 Y14 Y21 Y22 Y23 Y24 Y31 Y32 Y33 Y34 Y35 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 × µ α1 α2 α3 + e11 e12 e13 e14 e21 e22 e23 e24 e31 e32 e33 e34 e35 Note that 1. The entries of X are either 0 or 1 (because here all terms in the structural part of the model are categorical). X is often called the design matrix because it describes the design of the study, i.e. it describes which of the factors in the model contributed to each of the response values. 2. The sum of the last three columns of X add up to the first column. Thus X is a 13 × 4 matrix with column rank of 3. The matrix X′ X will be a 4 × 4 matrix of rank 3. 3. From row 1: Y11 = µ + α1 + e11 . What is Y32 ? To find estimates for the parameters we can use the method of least squares or maximum likelihood, as for regression. The least squares estimates minimise the error sum of squares: SSE = (Y − Xβ)′ (Y − Xβ) = ∑ ∑(Yij − µ − αi )2 i j 26 sta2005s: design and analysis of experiments β′ = (µ, α1 , α2 , α3 ) and the estimates are given by the solution to the normal equations X ′ Xβ = X ′ Y Since the sum of the last three columns of X is equal to the first column there is a linear dependency between columns of X, and X ′ X is a singular matrix, so we cannot write β̂ = ( X ′ X )−1 X ′ Y The set of equations X ′ Xβ = X ′ Y are consistent, but have an infinite number of solutions. Note that we could have used only 3 parameters µ1 , µ2 , µ3 , and we actually only have enough information to estimate these 3 parameters, because we only have 3 group means. Instead we have used 4 parameters, because the parametrization using the effects αi is more convenient in the analysis of variance, especially when calculating treatment sums of squares (see later). However, we also know that Nµ = n1 µ1 + n2 µ2 + n3 µ3 = n1 µ + n1 α1 + n2 µ + n2 α2 + n3 µ + n3 α3 . The RHS becomes (n1 + n2 + n3 )µ + ∑i ni αi = Nµ + ∑i ni αi . From this follows that ∑i ni αi = 0. The normal equations don’t know this so we add this additional equation (to calculate the fourth parameter from the other three) as a constraint in order to get the unique solution. In other words, if we have ∑i ni αi = 0 then the αi ’s have exactly the meaning intended above: they measure the difference in mean response with treatment i compared to the overall mean; µi = µ + αi . We could define the αi ’s differently, by using a different constraint, e.g. Yij = µ + αi + eij , α1 = 0 Here the mean for treatment 1 is used as a reference category and equals µ. Then α2 and α3 measure the difference in mean between group 2 and group 1 and between group 3 and group 1 respectively. This parametrization is the one most common in regression, e.g. when you add a categorical variable in a regression model the β estimates are defined like this: as differences relative to the first/baseline/reference category. Now, back to a solution for the normal equations: 27 sta2005s: design and analysis of experiments 1. A constraint must be applied to obtain a particular solution β̂. 2. The constraint must remove the linear dependency, so it cannot be any linear combination of the rows of X. Denote the constraint by Cβ = 0. 3. The estimate of β subject to the given constraint is unique. For this reason the constraint should be specified as part of the model. So we write Yij = µ + αi + eij ∑ ni αi = 0 or in matrix notation Y = Xβ + e Cβ = 0 where no linear combination of the rows of C is a linear combination of the rows of X. 4. Although the estimates of β depend on the constraints used, the following quantities are unique. (a) The fitted values Ŷ = X β̂. (b) The Regression or Treatment Sum of Squares. (c) The Error Sum of Squares, (Y − X β̂)′ (Y − X β̂). (d) All linear combinations, ℓ′ β̂, where ℓ = L′ X (predictions fall under this category). These quantities are called estimable functions of the parameters. Speed-reading Example Y= 82 80 81 83 71 79 78 74 91 93 84 90 88 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 × µ α1 α2 α3 + e11 e12 e13 e14 e21 e22 e23 e24 e31 e32 e33 e34 e35 = Xβ + e 28 sta2005s: design and analysis of experiments The normal equations, X ′ Xβ = X ′ Y are 1 1 0 0 1 1 0 0 1 1 0 0 1 1 = 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 82 80 81 83 71 79 78 74 91 93 84 90 88 0 0 0 0 0 0 0 0 1 1 1 1 1 Multiply X ′ Xβ = X ′ Y 13 4 4 5 µ 4 4 5 4 0 0 α1 0 4 0 α2 0 0 5 α3 = 1074 326 302 446 = ∑ij Yij ∑ j Y1j ∑ j Y2j ∑ j Y3j The sum of the last three columns of X ′ X = column 1, hence the columns are linearly dependent. So • X ′ X is a 4 × 4 matrix with rank 3. • X ′ X is singular • ( X ′ X )−1 does not exist • There are an infinite number of solutions that satisfy the equations! To find the particular solution we require, we add the constraint, which defines how the parameters are related to each other. µ α1 α2 α3 29 sta2005s: design and analysis of experiments The effect of different constraints on the solution to the normal equations We illustrate the effect of different sets of constraints on the least squares estimates using the speed-reading example. The normal equations are: X ′ Xβ = 13 4 4 5 µ 4 4 5 α 4 0 0 1 0 4 0 α2 α3 0 0 5 = 1074 326 302 446 = X′Y The sum-to-zero constraint ∑ ni αi = 0: This constraint can be written as Cβ = 0µ + 4α1 + 4α2 + 5α3 = 0 Using this constraint the normal equations become 13 4 4 5 µ 0 0 0 α 4 0 0 1 0 4 0 α2 α3 0 0 5 = 1074 326 302 446 1074 326 302 446 and their solution is µ̂ = 82.62 α̂1 = −1.12 α̂2 = −7.12 α̂3 = 6.58 β̂′ X ′ Y = 82.62 −1.12 −7.12 6.58 = 89153.2 β̂′ X ′ Y = SSmean + SStreamtment = ∑ ∑(Y¯i. − Ȳ.. )2 + ∑ ∑(Ȳ.. − 0)2 . This assumes that the total sum of squares was calculated as ∑ Yi′ Yi . β̂′ X ′ Y is used here because it is easier to calculate by hand than the usual treatment sum of squares β̂′ X ′ Y − N1 Y ′ Y. The error sum of squares is (Y − X β̂)′ (Y − X β̂) = 92.8. The fitted values are Ŷ1j = µ̂ + α̂1 = 81.5 in Group 1 Ŷ2j = µ̂ + α̂2 = 75.5 in Group 2 30 sta2005s: design and analysis of experiments Ŷ3j = µ̂ + α̂3 = 89.2 in Group 3 The corner-point constraint α1 = 0: This constraint is important as it is the one used most frequently for regression models with dummy or categorical variables, e.g. regression models fitted in R. Now Cβ = 0 1 0 0 µ α1 α2 α3 = α1 = 0 This constraint is equivalent to removing the α1 equation from the model, so we strike out the row and column of the X’X corresponding to the α1 and the normal equations become 13 4 5 µ 1074 4 5 4 0 α2 = 302 446 0 5 α3 µ̂ = 81.5 α̂1 = 0 α̂2 = −6.0 α̂3 = 7.7 β̂′ X ′ Y = 81.5 0 −6 7.7 1074 326 302 446 = 89153.2 The error sum of squares is (Y − X β̂)′ (Y − X β̂) = 92.8 The fitted values are Ŷ1j = µ̂ = 81.5 in Group 1 Ŷ2j = µ̂ + α̂2 = 75.5 in Group 2 Ŷ3j = µ̂ + α̂3 = 89.2 in Group 3 which are the same as previously. However, the interpretation of the parameter estimates is different. µ is the mean of treatment 1, α2 is the difference in means between treatment 2 and treatment 1, etc.. Treatment 1 is the baseline or reference category. This is the parametrization typically used when fitting regression models, e.g. in R, which calls it ’treatment contrasts’. The constraint µ = 0 will result in the cell means model: yij = αi + eij or µi + eij . We summarise the effects of using different constraints in the table below: 31 sta2005s: design and analysis of experiments Model Constraint µ α̂1 α̂2 α̂3 Ŷ1j Ŷ2j Ŷ3j β̂′ X ′ Y Error SS µ + αi µ + αi αi ∑ ni αi = 0 α1 = 0 µ=0 82.6 81.5 0 -1.1 0 81.5 -7.1 -6 75.5 6.6 7.7 89.2 81.5 81.5 81.5 75.5 75.5 75.5 89.2 89.2 89.2 89153.2 89153.2 89153.2 92.8 92.8 92.8 32 We will be using almost exclusively the sum-to-zero constraint as this has a convenient interpretation and connection to sums of squares, and the analysis of variance. Design matrices of less than full rank If the design matrix X has rank r less than p (number of parameters), there is not a unique solution for β. There are three ways to find a solution: 1. Reducing the model to one of full rank. 2. Finding a generalized inverse ( X ′ X )− . 3. Imposing identifiability constraints. To reduce the model to one of full rank we would reduce the parameters to µ, α2 , α3 , . . ., with α1 implicitly set to zero1 . 1 This is what R uses by default in its lm() function (corner-point constraint). We won’t deal with generalized inverses in this course. To impose the identifiability constraints we write the constraint as Hβ = 0. And then solve the augmented normal equations: X ′ Xβ = X ′ Y and H ′ Hβ = 0 " # " # X′ X X′Y β = Gβ = 0 H′ H (X′ X + H′ H )β = X′ Y β̂ = ( X ′ X + H ′ H )−1 X ′ Y Parameter estimates for the single-factor completely randomised design Suppose an experiment has been conducted as a completely randomised design: N subjects were randomly assigned to a treatments, where the ith treatment has ni subjects, with ∑ ni = N, and Yij = jth observation in ith treatment group. The data have the form: sta2005s: design and analysis of experiments Group II ... I Y11 Y12 .. . Y1n1 Means Totals Variances Y 1· Y1· s1 2 a Y21 Y22 Ya1 Ya2 .. . Yana Y2n2 Y 2· Y2· s2 2 Y a· Ya· sa 2 Y ·· Y·· The first subscript is for the treatment group, the second for the replication. The group totals and means are expressed in the following dot notation: ni group total Yi· = ∑ Yij j =1 group mean Y i· = Yi· /ni and overall total Y·· = ∑ ∑ Yij i overall mean j Y ·· = Y·· /N Let Yij = jth observation in ith group. The model is: Yij = µ + αi + eij ∑ ni αi = 0 where µ αi eij = general mean = effect of the ith level of treatment factor A = random error distributed as N (0, σ2 ). The model can be written in matrix notation as Y = Xβ + e with e ∼ N (0, σ 2 I ) 33 sta2005s: design and analysis of experiments Cβ = 0 where Y= Y11 Y12 .. . Y1n1 Y21 Y22 .. . Y2n2 .. . Ya1 .. . Yana = 1 1 .. . 1 1 1 .. . 1 .. . 1 .. . 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 ... ... ... 0 0 0 0 0 × 0 1 µ α1 α2 .. . .. . αa + 1 e11 e12 .. . .. . .. . .. . .. . .. . .. . eana = Xβ + e and the constraint as Cβ = 0 n1 n2 ... na µ α1 α2 .. . αa =0 There are a + 1 parameters subject to 1 constraint. To estimate the parameters we minimize the residual/error sum of squares S = (Y − Xβ)′ (Y − Xβ) = ∑(Yij − µ − αi )2 ij where ∑ij = ∑i ∑ j . Let’s put numbers to all of this and assume a = 3, n1 = 4, n2 = 4 and n3 = 5. Then 34 sta2005s: design and analysis of experiments Y11 Y12 Y13 Y14 Y21 Y22 Y23 Y24 Y31 Y32 Y33 Y34 Y35 1 1 1 1 1 1 1 1 1 1 1 1 1 = 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 × µ α1 α2 α3 + eij subject to Cβ = 0 Cβ = 0 4 4 5 µ α1 α2 α3 =0 1 0 1 0 1 0 1 0 4 4 5 µ 4 0 0 α1 0 4 0 α2 0 0 5 α3 X′X β = = 13 4 4 5 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 = 1 0 1 0 1 0 0 1 13µ 4µ 4µ 5µ 1 0 0 1 1 0 0 1 1 0 0 1 + 4α1 + 4α1 1 0 0 1 + 4α2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 + 5α3 + 4α2 + 5α3 µ α1 α2 α3 35 sta2005s: design and analysis of experiments 1 1 X′Y = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 ∑ij Yij ∑ j Y1j ∑ j Y2j ∑ j Y3j = 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 NY ·· n 1 Y 1· n 2 Y 2· n 3 Y 3· = 1 0 0 1 1 0 0 1 which results in the normal equations: 13µ 4µ 4µ 5µ + 4α1 + 4α1 + 4α2 + 5α3 + 4α2 + 5α3 = 13Y ·· = 4Y1· = 4Y2· = 5Y3· The constraint says that: 0µ + 4α1 + 4α2 + 5α3 Which implies that µ̂ = 13Y ·· =⇒ Y ·· = α̂i = Y i· − Y ·· 13µ And that So to summarize , the normal equations are X ′ Xβ = X ′ Y =0 Y11 Y12 Y13 Y14 Y21 Y22 Y23 Y24 Y31 Y32 Y33 Y34 Y35 36 sta2005s: design and analysis of experiments where X ′ Xβ = N n1 n2 .. . nn n1 n1 0 .. . 0 n2 0 n2 .. . 0 na 0 0 .. . na ... ... µ α1 .. . αa = X′Y = NY ·· n 1 Y 1· n 2 Y 2· .. . n a Y a· Using the constraint Cβ = 0 n1 n2 na ... µ α1 .. . αa =0 the set of normal equations are n1 µ Nµ + n1 α1 .. . = NY ·· = n 1 Y 1· .. . na µ + na αa = n a Y a· Solving these equations gives the least squares estimators µ̂ = Y ·· µ̂i = Y i· and α̂i = Y i· − Y ·· for i = 1, . . . , a. Parameter estimation for many of the standard experimental designs is straightforward! From general theory we know that the above are unbiased estimators of µ and the αi ’s. An unbiased estimator of σ2 is found by using the minimum value of the residual sum of squares, SSE, and dividing by its degrees of freedom. min(SSE) = ∑(Yij − µ̂ − α̂i )2 ij = ∑(Yij − Y i· )2 ij E(Yij − Ȳi. )2 = Var (Yij − Ȳi. ) = σ2 (1 − 1 ) n 37 sta2005s: design and analysis of experiments Hint: Cov(Yij , n1 ∑ j Yij ) = n1 σ2 Then E[SSE] = E h ∑ ∑(Yij − Ȳi. )2 1 = anσ2 (1 − ) = a(n − 1)σ2 n i SSE E[ MSE] = E = σ2 N−a So s2 = 1 N−a ∑(Yij − Yi· )2 ij is an unbiased estimator for σ2 , with ( N − a) degrees of freedom since we have N observations and (a + 1) parameters subject to 1 constraint. This quantity is also called the Mean Square for Error or MSE. 3.3 Standard Errors and Confidence Intervals Mostly, the estimates we are interested in are linear combinations of treatment means. In such cases it is relatively straightforward to calculate the corresponding variances (of the estimates): Var (µ̂) = Var (Y ·· ) = Var (∑i ∑ j Yij /N ) = N12 ∑i ∑ j Var (Yij ) = = 1 Nσ2 N2 σ2 N 2 The estimated variance is then sN where s2 is the mean square for error (least squares estimate, see above). Var (α̂i ) = Var (Y i· − Y ·· ) = Var (Y i· ) + Var (Y ·· ) − 2Cov(Y i· Y ·· ) Consider Cov(Y i· Y ·· ) = Cov(Y i· , ∑ nNk Y k· ) = ∑k nNk Cov(Y i· Y k· ) But since the groups are independent cov(Y i· Y k· ) is zero if i ̸= k. 2 If i = k then cov(Y i· Y k· ) = Var (Y i· ) = σn . Using this result and summing we find cov(Y i· Y ·· ) = σ2 N. i Hence 38 sta2005s: design and analysis of experiments Var (α̂i ) = = = σ2 σ2 2σ2 ni + N − N ( N − ni ) σ2 ni N σ2 σ2 ni − N Important estimates and their standard errors A standard error is the (usually estimated) standard deviation of an estimated quantity. It is the square root of the variance of the sampling distribution of this estimator and is an estimate of its precision or uncertainty. Parameter Estimate Standard Error √S N Overall mean µ Y ·· Experimental error variance σ2 s2 = ith αi Y i· − Y ·· Effect of the treatment Difference between two treatments α1 − α2 µi = µ + αi Treatment mean 1 N −a ∑ij (Yij − Y i· )2 Y 1· − Y 2· q Y i· q How do we estimate σ2 ? s2 = = = = q 1 N −a ∑ij (Yij − Y i· )2 Mean Square for Error = MSE Within Sum of Squares/(Degrees of Freedom) SSresidual /d f residual Assuming (approximate) normality of an estimator, a confidence interval for the corresponding population parameter has the form estimate ± tα/2 × standard error ν Where tα/2 is the α/2th percentile of Student’s t distribution with ν ν degrees of freedom. The degrees of freedom of t are the degrees of freedom of s2 . Speed-reading Example Estimates, standard errors and confidence intervals: ni 2 ( Nn− )s iN s2 ( n1 + 1 s2 ni 1 n2 ) = sed 39 sta2005s: design and analysis of experiments Effect (αi ) Method I Method II Method III Mean (µi ) Method I Method II Method III Overall mean Estimate α̂1 = −1.12 α̂2 = −7.12 α̂3 = 6.58 Standard Error 1.27 1.27 1.25 µ̂1 = µ̂2 = µ̂3 = µ = 82.62 95% Confidence Interval 0.85 3.4 Analysis of Variance (ANOVA) The next step is testing the hypothesis of no treatment effect: H0 : α1 = α2 = . . . = 0. This is done by a method called Analysis of Variance, even though we are actually comparing means. Note that so far we have not used the assumption of eij ∼ N (0, σ2 ). The least squares estimates do not require the assumption of normal errors! However, to construct a test for the above hypothesis we need the normality assumption. In what follows, we assume that the errors are identically and independently distributed as N (0, σ2 ). Consequently the observations are normally distributed, though not identically. We must check this assumption of independent, normally distributed errors, else our test of the above hypothesis could give a very misleading result. Decomposition of Sums of Squares Let’s assume Yij are data obtained from a CRD, and we are assuming model 3.1: Yij = µ + αi + eij In statistics, sums of squares refer to squared deviations (from a mean or expected value), e.g. the residual sum of squares is the sum of squared deviations of observed from fitted values. Lets rewrite the above model by substituting observed values and rewriting the terms as deviations from means: Yij − Ȳ·· = (Ȳi. − Ȳ.. ) + (Yij − Yi. ) Make sure you agree with the above. Now square both sides and sum over all N observations: 40 sta2005s: design and analysis of experiments ∑ ∑(Yij − Y·· )2 = ∑ ∑(Yij − Yi· )2 + ∑ ∑(Yi· − Y·· )2 + 2 ∑ ∑(Yij − Yi· )(Yi· − Y·· ) i j i j i j i j The crossproduct term is zero after summation over j since it can be written as 2 ∑(Y i· − Y ·· ) ∑(Yij − Y i· ) i j The second term is the sum of the observations in the ith group about their mean value, so the sum is zero for each i. Hence ∑ ∑(Yij − Y·· )2 = ∑ ∑(Yi· − Y·· )2 + ∑ ∑(Yij − Yi· )2 i j i j i j So the total sum of squares partitions into two components: (1) squared deviations of the treatment means from the overall mean, and (2) squared deviations of the observations from the treatment means. The latter is the residual sum of squares (as in regression, the treatment means are the fitted values). The first sum of squares is the part of the variation that can be explained by deviations of the treatment means from the overall mean. We can write this as SStotal = SStreatment + SSerror The analysis of variance is based on this identity. The total sum of squares equals the sum of squares between groups plus the sum of squares within groups. Distributions for Sums of Squares Each of the sums of squares above can be written as a quadratic form: Source treatment A residual total SS df Y′ ( H 1 n J )Y − SSA = SSE = Y ′ ( I − H )Y SST = Y ′ ( I − n1 J )Y a−1 N−a N−1 where J is the n × n matrix of ones, and H is the hat matrix X ( X ′ X ) −1 X ′ . 41 sta2005s: design and analysis of experiments Q: From your regression notes, what does that imply for the distributions of these sums of squares? Cochran’s Theorem Let Zi ∼ iidN (0, 1) with i = 1, . . . , v v ∑ Zi2 = Q1 + Q2 + . . . + Qs i =1 s ≤ v and Qi has vi d.f. then Q1 , . . . , Qs are independent χ2 random variables with v1 , . . . , vs d.f., respectively, if and only if v = v1 + v2 + . . . + v s Expected Mean Squares LEMMA A. Let Xi , where i = 1, . . . , n, be independent random variables with E( Xi ) = µi and Var ( Xi ) = σ2 . Then E( Xi − X̄ )2 = (µi − µ̄)2 + n−1 2 σ n where µ̄ = n 1 n ∑ µi i =1 [Hint: E(U 2 ) = [ E(U )]2 + Var (U ). Take U = Xi − X̄. ] THEOREM A. Under the assumptions for the model Yij = µ + αi + eih , and assuming all ni = n E(SSerror ) = ∑ ∑ E(Yij − Ȳi. )2 i j = ( N − a ) σ2 E(SStreatment ) = ∑ ∑ E(Ȳi. − Ȳ.. )2 i = j ∑ ni E(Ȳi. − Ȳ.. )2 i = ∑ ni α2i + (a − 1)σ2 i 2 error MSE = SS N − a may be used as an estimate for σ . It is an unbiased estimator. If all the αi are equal to zero, then the expectation of SStreatment /( a − 1) is also σ2 ! 42 sta2005s: design and analysis of experiments F test of H0 : α1 = α2 = · · · = α a = 0 THEOREM B. If the errors are independent and normally distributed with means 0 and variances σ2 , then SSerror /σ2 follows a chi-square distribution with ( N − a) degrees of freedom. If, additonally, the αi are all equal to zero, then SStreatment /σ2 follows a chi-square distribution with a − 1 degrees of freedom and is independent of SSerror . Proof. We first consider SSerror . From STA2004F 1 σ2 ni ∑ (Yij − Ȳi. )2 j =1 follows a chi-square distribution with ni − 1 degrees of freedom. There are a such sums in SSerror , and they are independent of each other since the observations are independent. The sum of a independent chi-square random variables that each have n − 1 degrees of freedom follows a chi-square distribution with N − a degrees of freedom. The same reasoning can be applied to SStreatment , noting that Var (Ȳi. ) = σ2 /ni . We next prove that the two sums of squares are independent of each other. SSerror is a function of the vector U, which has elements Yij − Ȳi. , where i = 1, . . . , a and j = 1, . . . , n. SStreatment is a function of the vector V, whose elements are Ȳi. . Thus, it is sufficient to show that these two vectors are independent of each other. First, if i ̸= i′ , Yij − Ȳi. and Ȳi′ . are independent since they are functions of different observations. Second, Yij − Ȳi. and Ȳi. are independent (by another Theorem from STA2004F). This completes the proof of the theorem. ■ Under H0 : α1 = α2 = · · · = α a = 0 F= SStreatment a −1 SSerror N −a = MStreatment MSE has a central F distribution: F ∼ Fa−1,N − a E[ F ] = d2 d2 − 2 where d2 is the denominator degrees of freedom. If H0 is false, F has a non-central F distribution with non-centrality parameter 43 sta2005s: design and analysis of experiments 44 ∑ n α2 λ = i σ2i i , and the statistic will tend to be larger than 1. Large values of F therefore provide evidence against H0 . This is always a one-sided test. Why? THEOREM C. Under the assumption that the errors are normally distributed, the null distribution of F is the F distribution with ( a − 1) and ( N − a) degrees of freedom. Proof. The proof follows from the definition of the F distribution, as the ratio of two independent chi-square random variables divided by their degrees of freedom. ■ ANOVA table These results can be summarised in an analysis of variance (ANOVA) table. Source SS df Mean Square F SS A a −1 SSE N −a MS A MSE treatment A ∑i ni (Y i· − Y ·· )2 a−1 residual (error) ∑i ∑ j (Yij − Y i· )2 N−a )2 N−1 total ∑i ∑ j (Yij − Ȳ.. EMS ∑ ni α2i a −1 2 σ σ2 + This is a one-way analysis of variance. The ‘one-way’ refers to there only being one factor in the model and thus in the ANOVA table. Note that the ANOVA table is still based on the model 3.1, and will have one SS for each term in the model (except the mean), but see the table below. To test H0 : α1 = α2 = · · · = α a = 0, we use F = MS A MSE ∼ Fa−1,N −a . Some Notes 1. An alternative, less common, form for the ANOVA table is Source mean µ treatment residual (error) total SS 2 NY ∑i ni (Y i· − Y ·· )2 ∑i ∑ j (Yij − Y i· )2 ∑i ∑ j Yij2 = Y ′ Y df Mean Square F 1 a−1 N−a N SS A a −1 SSE N −a MS A MSE There is an extra term due to the mean with 1 degree of freedom. The error and treatment SS, and the F test, are the same as previously. sta2005s: design and analysis of experiments In the table above we have split the total variation, SStot = ∑i ∑ j Yij2 with N degrees of freedom into three parts, namely SStot = SSµ + SS A + SSE with degrees of freedom N = 1 + ( a − 1) + ( N − a ) respectively. Each SS can be identified with a term in the model Yij = µ + αi + eij for i = 1, . . . , a j = 1, . . . , ni 2. We have closed form expressions for each of the Sums of Squares. This is in contrast to multiple regression, where usually explicit expressions cannot be given for the individual regression sum of squares. Furthermore, subject to the constraints, we have closed form expressions for the parameter estimates as well. 3. The error sum of squares can be written as a ∑ (ni − 1)s2i i =1 where s2i is the variance of the ith group, s2i = ∑(Yij − Yi· )2 /(ni − 1) So the Mean Square for Error, MSE, is a pooled estimate of σ2 , s2 = (n1 − 1)s21 + (n2 − 1)s22 + . . . + (n a − 1)s2a n1 + n1 + . . . + n a − a For a = 2, this is the estimate of σ2 we use in the two-sample t-test (assuming equal variances). 4. The treatment sum of squares could also be used to estimate σ2 if we assume H0 is true. Recall that for any mean X̄, Var ( X̄ ) = σ2 /n, so SS A = ∑i ni (Y i· − Y ·· )2 measures the variation in the group means about the overall mean. If the means do not differ, 45 sta2005s: design and analysis of experiments i.e. H0 is true, then SS A /( a − 1) should also estimate σ2 . So the test of H0 : α1 = α2 = · · · = α a = 0 made using MS A SS A /( a − 1) = ∼ Fa−1,N −a MSE SSE( N − a) is an F-test comparing variances. This is the origin of the term Analysis of Variance. 5. The Analysis of Variance for comparing the means of a number of groups is equivalent to a regression analysis. However, in ANOVA, the emphasis is slightly different. In regression analysis, we test if an arbitrary subset of the parameters is zero [H0 : β(2) = 0]. In ANOVA we are interested in testing if a particular subset, namely α1 , α2 , . . . α a , are zero. 6. MSE, the mean square for error 2 MSE = NSSE − a is an unbiased estimator of σ , provided the model used is the correct one. MSE = σ̂2 = 1 N−a ∑ ∑(Yij − Ȳi· )2 = s2 i j Note that ( N − a) = ∑ia=1 (ni − 1). This estimate of σ2 is used for all comparisons of the treatment/group means. Speed-reading Example Mean Std Deviation ni METHOD I II III 82 71 91 80 79 93 81 78 84 83 74 90 88 81.5 75.5 89.2 1.3 3.7 3.4 4 4 5 46 sta2005s: design and analysis of experiments Sums of Squares N Ȳ··2 ∑ Yij2 ∑ ni Ȳi2· SStotal SS A SSerror = = = = = = 88729 89246 89153 89246 − 88729 89153 − 88729 SStotal − SS A ANOVA Table Source Teaching methods Error Total SS 424 93 517 = 517 with (N − 1 = 12) df = 424 with (α − 1 = 2) df = 517 − 524 = 93 with (N − a = 10) df df 2 10 12 Mean Square 212 9.3 F stat 22.8 p-value 0.0001862 From this table we would conclude that: There is strong evidence (p < 0.001) that reading speed differs between teaching methods (F = 22.8 ∼ F2,10 ). Now that we have found evidence that the teaching methods differ, we can finally answer the really important question: Which is the best method? How much better is it? For this we need to compare the three methods (means) amongst each other. We do this in Chapter 4. We could also skip the ANOVA table and jump to the real questions of interest immediately. But very often, an ANOVA is performed to obtain a summary of which factors are responsible for most of the variation in the data. 3.5 Randomization Test for H0 : α1 = . . . = α a = 0 Randomization tests can be used for data from properly randomized experiments. In fact, the ONLY assumption required for randomization tests to be valid and give exact p-values is that treatments were randomly assigned to experimental units according to the rules of the particular experimental design. No assumptions about normality, equal variance, random samples, independence are needed. Therefore, some call randomisation tests the ultimate nonparametric tests (Edgington, 2007). Fisher (1936) used this fact as one of his strong arguments for the requirement of randomisation in experiments and said about the F and t tests mostly used for the analysis of experiments: “conclusions have no justification beyond the fact that they agree with those which could have been arrived at by this elementary method”, the elementary method being the randomization tests. The randomisation test is a statistical test in which the distribution of the test statistic under the null hypothesis is obtained by calculating the test statistic under all possible rearrangements of the observed data points. 47 sta2005s: design and analysis of experiments The idea is as follows: if the null hypothesis is true, the treatment has no effect, and the observed values merely reflect natural variation in the experimental units. Calculating the test statistic (e.g. difference between means) under all possible randomizations of treatments to experimental units then gives an idea of the distribution of the difference under H0 . Comparing our observed test statistic to this null or reference distribution can tell us how likely the observed statistic is relative to what we would expect under H0 , expressed as a p-value. The p-value will tell us how often we would expect a difference this extreme, under the null hypothesis that the treatments have no effect. Example: Suppose there are 6 subjects, 2 treatments, 3 subjects randomly assigned to each treatment. The response measured is the reaction time (in seconds). The null hypothesis will state that the mean (or median) reaction time is the same for each treatment. The alternative hypothesis states that at least one subject would have provided a different reaction time under a different treatment. The actual/observed/realised randomisation which resulted in the observed responses, is only 1 of (63) = 20 possible randomisations of the 6 subjects to the 2 treatments. Each of these 20 possible randomisations were equally likely. Under H0 the treatment has no effect, i.e. the observed values are differences between the subjects not due to treatments, and thus the observed values would stay the same with a different randomisation. We can now construct a reference distribution (under H0 ) from the 20 test statistics obtained for the different possible randomisations. The observed test statistic is compared to this reference distribution and the p-value calculated as the proportion of values ≥ observed test statistic (or ≤, depending on the alternative hypothesis). Example: Tomato Plants This is an experiment whose objective was to discover whether a change in the fertilizer mixture applied to tomato plants would result in an increased yield. Eleven plants in a single row, were randomly assigned so that 5 were given standard fertilizer mixture A and 6 were fed a supposedly improved mixture B. The gardener took 11 playing cards, 5 red and 6 black, thoroughly shuffled these and then dealt them to result in a given sequence of red (A) and black (B) cards. 48 sta2005s: design and analysis of experiments Position Fertilizer Pounds of Tomatoes 1 A 29.9 2 A 11.4 3 B 26.6 4 B 23.7 5 A 25.3 standard fertilizer A modified fertilizer B nA = 5 ∑ y A = 104.2 ȳ A = 20.84 nB = 6 ∑ y B = 135.2 ȳ B = 22.53 6 B 28.5 7 B 14.2 8 B 17.9 9 A 16.5 10 A 21.1 11 B 24.3 Table 3.1: Tomato Plant Data difference in means (modified minus standard) ȳ B − ȳ A = 1.69 • H0 : modifying the fertilizer mixture has no effect on the results and therefore, in particular, no effect on the mean. H0 : µ B − µA = 0 • H1 : the modified fertilizer (B) gives a higher mean. H1 : µ B − µA > 0 • There are theoretically 11! = 462 possible ways of allocating 5 A’s and 6 B’s to the 11 plants 5!6! The given experimental arrangement is just one of the 462, any one of which could equally well have been chosen. To calculate the randomisation distribution appropriate to the H0 that modification is without effect (i.e. that µ A = µ B ), we need to calculate all 462 differences in the averages obtained from the 462 possible arrangements. The table above shows one such arrangement with its corresponding difference in means = 1.69. Another arrangement could have been: Position in row Fertilizer Pounds of Tomatoes 1 A 29.9 2 B 11.4 standard fertilizer A 3 B 26.6 4 A 23.7 5 A 25.3 6 B 28.5 7 A 14.2 modified fertilizer B nA = 5 nB = 6 ∑ y A = 114.2 ∑ y B = 125.2 ȳ A = 22.84 ȳ B = 20.87 ȳ B − ȳ A = −1.97 There are 460 more such arrangements with resulting differences in means. These 462 differences are summarised by the histogram below. 8 B 17.9 9 B 16.5 10 A 21.1 11 B 24.3 49 sta2005s: design and analysis of experiments 50 Figure 3.1: Randomisation distribution for tomato plant data. The red cross indicates the observed difference ȳ A − ȳ B . 0.10 density 0.08 0.06 0.04 0.02 x 0.00 −10 −5 0 5 10 difference in means The observed difference of 1.69 is indicated with a cross. We find that in this example, 154 of the possible 462 arrangements yield 154 differences greater than or equal to 1.69: p = 462 = 0.33. The p-value of 0.33 suggests that the observed difference in means is likely under H0 , and therefore that we cannot conclude that fertilizer B resulted in a higher mean. As a comparison, the p-value from the two-sample t-test (assuming equal variance) is 0.34, very close. The reason is that in this example the assumptions required for the t-test are met. However, note that in the case where these assumptions are not met (e.g. skew distributions), the randomisation test p-value will give us a much better p-value (as a measure of how extreme the observed statistic is relative to the null hypothesis). 3.6 A Likelihood Ratio Test for H0 : α1 = . . . = α a = 0 In general, likelihood ratio tests compare two nested models, by comparing their likelihoods (in the form of a ratio). The likelihood ratio compares the relative support for the two models based on the information in the data. For the hypothesis test H0 : α1 = . . . = α a = 0 we will compare the following two models: let model Ω assume that there are differences between the treatments, and let model ω assume that the treatments have no effect, i.e. a model that corresponds to H0 being true. sta2005s: design and analysis of experiments 51 (a) Model Ω is Y = µ + αi + eij eij ∼ N (0, σ2 ) Or equivalently Yij ∼ N (µ + αi , σ2 ) (b) Model ω is (also called the restricted or the null model) is Y = µ + eij eij ∼ N (0, σ2 ) Or equivalently Yij ∼ N (µ, σ2 ) A likelihood ratio test for H0 can be constructed using λ= L(ω̂ ) L(Ω̂) (3.2) where L(ω̂ ) is the maximized likelihood if H0 is true and L(Ω̂) is the maximum value of the likelihood when the parameters are unrestricted. Essentially, we need to fit two models. To obtain the likelihood for model Ω, we are assuming independent observations, and can therefore multiply the probabilities of the observations: L(µ, αi , σ2 ) = 1 ∏ √2πσ2 exp ij n −1 2σ2 (Yij − µ − αi )2 o (3.3) The log-likelihood is l (µ, αi , σ2 ) = −N N 1 log2π − logσ2 − 2 2 2 2σ a ni i j ∑ ∑(Yij − µ − αi )2 (3.4) where N = ∑1n ni . For fixed σ2 this is maximised when the last term is a minimum. But this term is exactly the sum of squares that was minimized when finding the least squares estimate! So the least squares estimates are the same as the maximum likelihood estimates. (Note that this is only true for normal models, normal errors). Let Actually, we are approximating the probability by the density of each observation (probability = density times constant. We are ignoring the constant because multiplicative constants have no effect on maximum likelihood estimates). sta2005s: design and analysis of experiments RSS(Ω) = ∑ ∑(Yij − µ̂ − α̂i )2 i (3.5) j then the maximized log-likelihood for fixed σ2 is ℓ(Ω̂) = c − RSS(Ω̂) 2σ2 (3.6) where c=− N log(2πσ2 ) 2 Repeating the same argument for model ω, assuming α1 = α2 = . . . = α a = 0: L(ω ) = 1 ∏ √2πσ2 ! ( exp ij l (ω ) = −1 2σ2 ∑ ∑(Yij − µ)2 i −N N 1 log(2π ) − log(σ2 ) − 2 2 2 2σ ) j ∑ ∑(Yij − µ)2 i (3.7) j For fixed σ2 this is maximised when RSS(ω ) = ∑ ∑(Yij − µ)2 i (3.8) j is a minimum, and this occurs when µ is the least squares estimate and so RSS(ω̂ ) = ∑i ∑ j (Yij − µ̂)2 , where µ̂ = Y ·· . Then the maximum of (3.7) is (for fixed σ2 ) ℓ(ω̂ ) = c − RSS(ω̂ ) 2σ2 (3.9) We now take minus twice the difference of the log-likelihoods (corresponding to minus twice the log of the likelihood ratio) " λ= −2logλ = L(ω̂ ) L(Ω̂) # RSS(ω̂ ) − RSS(Ω̂) σ2 (3.10) (3.11) This has, for large samples, a chi-squared distribution with ( N − 1) − ( N − a) = a − 1 degrees of freedom. Note that this criterion 52 sta2005s: design and analysis of experiments looks at the reduction in the residual sum of squares and compares this to the error variance. One remaining problem is that σ2 is not known. In practice we estimate σ2 from the residual sum of squares of the larger model. σ̂2 = RSS(Ω̂) N−a σ̂2 ( N − a) ∼ χ2N −a σ2 Under the assumption of normality the likelihood ratio statistic −2logλ has an exact chi-square distribution when σ2 is known. When σ2 is estimated we use F= RSS(ω̂ )− RSS(Ω̂) a −1 RSS(Ω̂) N −a ∼ Fa−1,N −a Again note that the F test depends on the normality assumption. Verify that this is equivalent to the F test found in the sums of squares derivation above. For normal data, least squares and maximum likelihood will result in the same solution (parameter estimates). Actually, even the problem is the same: that of minimizing the error sum of squares. When the data are not normally distributed, least squares still provides estimates that minimize the squared deviations of the observed values from the estimated. They provide a best fit. However, other methods such as maximum likelihood may result in better out-of-sample prediction error. 3.7 Kruskal-Wallis Test This is a nonparametric test to compare more than two independent populations. It is the nonparametric version of the one-way ANOVA F-test which relies on normality of populations. For the Kruskal-Wallis test, the assumptions are that we have k independent random samples of sizes n1 , n2 , . . . , nk (k ≥ 3), independent observations within samples, that the k populations are identical except possibly with respect to location, and the data must be at least ordinal. Hypotheses: 53 sta2005s: design and analysis of experiments H0 : the k populations are identical (the k medians are equal) H1 : k populations are identical except with respect to location (median) To calculate the test statistic we rank all observations from 1 to N (N = ∑ik=1 ni ) (for ties assign the mean value of ranks to each observation). The test statistic is based on comparing each group’s mean rank with the mean of all ranks (weighted by sample size). For each group i calculate Ri = ∑ ranks in group i H 12 = N ( N + 1) 12 = N ( N + 1) k ∑ ∑ Ri N+1 2 − ni 2 R2i − 3( N + 1) ni For large sample sizes the distribution of the Kruskal-Wallis test statistic can be approximated by the χ2 -distribution with k − 1 degrees of freedom: H ≈ χ2k−1 When sampling from a normal distribution, the power of the Kruskal-Wallis test is almost equal to that of a classical F-test. When outliers are present, the Kruskal-Wallis test is much more reliable than the F-test. 54 4 Comparing Means: Contrasts and Multiple Comparisons This chapter goes more deeply into methods for exploring WHICH treatments differ, are best or worst, and by how much they differ. We do this using contrasts. There are statistical problems that occur when doing many tests (multiple comparisons or testing), and you will learn about methods to reduce the problems of multiple testing. It is important to be aware of these problems in order to be able to interpret and understand results from experiments or other types of analysis correctly. In most experiments we want to compare a number of treatments. For example, a new method of communicating with employees is compared to the current system. We want to know 1) whether there is an improvement (or change) in employee happiness, and, more importantly, 2) how large this change is. For the latter we need estimates and standard errors or confidence intervals. The analysis of variance table tells us how much evidence there is that the means differ. In this chapter we consider what the next steps in the analysis are. If there is no evidence for differences between the means, technically speaking the analysis is complete at this stage. However, one should remember that there are two possible situations/reasons that could both lead to this outcome of no evidence that the means differ. 1. There is truly no difference between the means (or it is so small that it is not interesting). 2. There is a difference, but the F-test did not have enough power. Technically, the power of a test is the probability of rejecting H0 if false. Reasons for lack of power are: (a) Too few observations on each mean. sta2005s: design and analysis of experiments (b) The variation, σ2 , is too large (relative to the differences between means). If this is the case, reducing σ2 by controlling for extraneous factors should be considered. Both of these are design issues. Therefore, it is crucial to think about power at the design stage of the experiment (see Chapter 6). Suppose however, that we have found enough evidence to warrant further investigation into which means differ, which don’t and by how much they differ. To do this we contrast one group of means with another, i.e. we compare groups of means to find out where the differences between treatments are. We can do this in two ways, either by constructing a test of the form H0 : µ A − µ B = 0 or by constructing a confidence interval for the difference of the form: est(diff) ± tν × SE(diff) The confidence interval is much more informative than the result from a hypothesis test. 4.1 Contrasts Consider the model Yij = µ + αi + eij with constraint ∑ αi = 0. We will be assuming that all ni are equal. (It is possible to construct contrasts with unequal ni , but this gets very complicated. This is one reason to design a CRD experiment with an equal number of experimental units per treatment.) Definition: A linear combination of the parameters αi , L = such that ∑ hi = 0 is called a contrast. ∑1a hi αi , Point estimates of Contrasts and their variances The maximum likelihood (and least squares) estimate of L is 56 sta2005s: design and analysis of experiments L̂ = ∑1a hi α̂i = ∑1a hi (Ȳi· − Ȳ·· ) = ∑1a hi Ȳi· since ∑ hi Ȳ·· = Ȳ·· ∑ hi = 0 Var ( L̂) = ∑1a h2i Var (Ȳi· ) = σ2 ∑1a ˆ ( L̂) Var = s2 ∑ h2i ni h2i ni where s2 is the mean square for error with ν degrees of freedom, e.g. ν = N − a in a CRD. Examples 1. α1 − α2 is a contrast with h1 = 1, h2 = −h1, h3 = 0i = · · · = h a = 0. Its estimate is Ȳ1· − Ȳ2· with variance s2 n1 + n12 . 1 2. α3 + α4 + α5 is not a contrast - why? Define a contrast to compare means 3 and 4 and 5. 3. I might want to compare average salaries in small companies to those in medium and large companies, maybe to answer the question whether one earns less in small companies. To do this I would construct a contrast: µsmall − µmed or large = µsmall − 12 (µmed + µlarge ) (assuming that the number of companies in each group was the same), i.e. I am comparing/contrasting two average salaries. The coefficients hi sum to zero: 1 − 12 − 21 . Sometimes contrasts are defined in terms of treatment totals, e.g. Yi· , instead of treatment means. We will mainly compare/contrast treatment means. Comments on Contrasts 1. Although we have defined contrasts in terms of the α’s, contrasts estimate differences between means, so they are very often simply called contrasts of means. Essentially they estimate the differences between groups of means. 2. In designed experiments the means are usually based on the same number of observations, n. However contrasts can be defined if there are unequal numbers of observations. 3. The estimate for σ2 is given by the Mean Square for Error, s2 = MSE. Its degrees of freedom depend on the design of the experiment, and the number of replicates. 57 sta2005s: design and analysis of experiments 4. Many multiple comparison methods depend on Student’s t distribution. The degrees of freedom of t = degrees of freedom of s2 . 5. The standard error (SE) of a contrast is a measure of its precision or uncertainty, how well were we able to estimate the difference in means? It is the standard deviation of the sampling distribution of the contrast. 6. An important contrast is that of the difference between two means Ȳ1· − Ȳ2· . Its standard error is called the standard error of the difference s.e.d. r s.e.d = s2 ( 1 1 + )=s n n r 2 n 4.2 Orthogonal Contrasts Orthogonal contrasts are sets of contrasts with specific mathematical properties: Two contrasts, L1 and L2 where L1 = ∑ h1i µi and L2 = ∑ h2i µi , are orthogonal if ∑i h1i h2i = 0. Orthogonality implies that they are independent, i.e. that they summarize different dimensions of the data. Example An experiment is conducted to determine the wearing quality of paint. The paint was tested under four conditions: 1. Hard wood dry climate µ1 . 2. Hard wood wet climate µ2 . 3. Soft wood dry climate µ3 . 4. Soft wood wet climate µ4 . What can you say about the treatment structure? How many factors are involved? We might want to ask the following questions: 1. Is the average life on hard wood the same as on soft wood? 2. Is the average life in a dry climate the same as in a wet climate? 3. Does the difference between wet and dry climates depend on whether or not the wood was hard or soft? 58 sta2005s: design and analysis of experiments These questions can be formulated before we see the results of the experiment. To answer them we would want to test the following hypothesis: 1. H0 : 21 (µ1 + µ2 ) = 12 (µ3 + µ4 ) or equivalently H0 : 12 µ1 + 12 µ2 − 12 µ3 − 12 µ4 = 0 2. H0 : 21 (µ1 + µ3 ) = 12 (µ2 + µ4 ) ≡ H0 : 12 µ1 + 21 µ3 − 12 µ2 − 21 µ4 = 0 3. H0 : 21 (µ1 − µ2 ) = 12 (µ3 − µ4 ) ≡ H0 : 12 µ1 − 21 µ2 − 12 µ3 + 21 µ4 = 0 This last contrast is testing for an interaction between type of wood and climate, i.e. does the effect of climate depend on type of wood (see Chapter 7). Clearing these contrasts of fractions we can write the coefficients hki in a table, where hki is the ith coefficient of the kth contrast. 1. Hard vs soft wood 2. Dry vs wet climate 3. Climate effect depends on wood type h1 1 1 1 h2 1 -1 -1 h3 -1 1 -1 h4 -1 -1 1 Although it is easier to manipulate contrasts that don’t contain fractions, and hypothesis tests will lead to the same results with or without fractions, confidence intervals will differ. Keeping the 21 s will lead to confidence intervals for the difference in means. Without the fractions, we obtain a confidence interval for 2× the difference in means. As we do want to understand what these values tell us, the first (with the 21 s) is much more useful. Note that ∑41 h1k = ∑41 h2k = ∑41 h3k = 0 by definition of our contrast. But also ∑41 h1k h2k = 0, i.e. the contrasts are orthogonal. This means that their estimate will be statistically independent under normal theory (or uncorrelated if non-normal). You can verify that contrasts 2 and 3 and 1 and 3 are also orthogonal. From the four means we have found three mutually orthogonal (independent) contrasts. In general, given p means we can find ( p − 1) orthogonal contrasts. There are many sets of ( p − 1) orthogonal contrasts - we can select a set that is convenient to us. If it so happens that the questions of interest result in a set of orthogonal contrasts, this will simplify the interpretation of results. However, it is more important to ask the right questions than to obtain a set of orthogonal contrasts. In some cases it is convenient to test contrasts in the context of analysis of variance. The treatment sums of squares, SS A say, have ( a − 1) degrees of freedom, when a treatments are compared. This sum of squares can be split into ( a − 1) mutually orthogonal 59 sta2005s: design and analysis of experiments (independent) sums of squares, each with 1 degree of freedom, each corresponding to a contrast, so that SS A = SS1 + SS2 + . . . + SSa−1 We can test for these ( a − 1) orthogonal contrasts simultaneously within the ANOVA table. Calculating sums of squares for orthogonal contrasts For convenience we assume the treatments are equally replicated (i.e. the same number of observations on each treatment). Let a = number of treatments, n = number of replicates per treatment, Ȳi. = mean for treatment i SS A = ∑ia=1 ∑n (Ȳi. − Ȳ.. )2 , the treatment SS with ( a − 1) df. Then: 1. L = h1 Ȳ1. + h2 Ȳ2. + . . . h a Ȳa. is a contrast if ∑1a hi = 0. 2. Var(L) = s2 n ∑i h2i where s2 = MSE. 3. L1 and L2 are orthogonal if ∑ h1i h2i = 0. 4. Sum of squares for L is SS L = nL2 ∑ h2i and has one degree of freedom. 5. If L1 and L2 are orthogonal then SS2 = SS A − SS1 nL22 ∑ h22i is a component of 6. If L1 , L2 , . . . , L a are ( a − 1) mutually orthogonal contrasts then the treatment sum of squares, SS A , can be partitioned as SS A = SS1 + SS2 + . . . + SSa−1 where each SSi has 1 degree of freedom. 7. The hypothesis H0 : Li = 0 versus H1 : Li ̸= 0 can be tested using SSi F = MSE with 1 and ν degrees of freedom where ν = degrees of freedom of MSE. 8. Orthogonal contrasts can be defined if there are unequal numbers of replications in each group, but the simplicity of the interpretation breaks down. With n1 , n2 . . . , n a observations in each group L = h1 Ȳ1. + h2 Ȳ2. + . . . + h a Ȳa. 60 sta2005s: design and analysis of experiments is a contrast iff n1 h1 + n2 h2 + . . . + n a h a = 0 and L1 and L2 are orthogonal iff n1 h11 h21 + n2 h12 h22 + . . . + n a h1a h2a = 0 An equal number of replicates for each treatment ensures that we have meaningful sets of orthogonal contrasts, each of which will explain some aspect of the experiment independently of any others. This gives a very clear interpretation of the results. If we have unequal numbers of replications of the treatments the different aspects cannot be completely separated. 9. The word “orthogonal" is used in the same sense as in mechanics where two orthogonal forces ↑→ act independently of each other. In an a dimensional space these can be seen as ( a − 1) perpendicular vectors. Example To compare the durability of different methods of finishing a piece of mass produced furniture, the production manager set up an experiment. There were two types of paint available (called A and B), and two methods of applying it: by brush or spray. Six pieces of furniture were randomly assigned to each treatment and the durability of each was measured. The treatments were: 1. Paint A with brush. 2. Paint B with brush. 3. Paint A with spray. 4. Paint B with spray. The experimenter is interested in comparing: 1. Paint A with Paint B. 2. Brush Method with Spray Method. 3. How methods compare within the two paints, i.e. is the difference between brush and spray the same for both paints? The treatment means were: 61 sta2005s: design and analysis of experiments Treatment 1 2 3 4 100 120 40 70 The ANOVA table is Source Treatments Error SS df 22050 3 2900 20 MS 7350 145 F 50.69 The treatment sum of squares can be split into three mutually orthogonal contrasts, as shown in the table: Application Method Paint Treatment Means 1.Paints A and B 2.Brush versus Spray 3.Methods within Paints Brush Spray A B A B 100 120 40 70 h1 h2 h3 h4 +1 −1 +1 −1 +1 +1 −1 −1 +1 −1 −1 +1 Treatment SS 4 Li = ∑ hi Ȳi. SSi = 1 nL2i ∑ h2ij SSi F= SSi Li MSE −50 3750 25.86 110 18150 125.17 10 150 1.03 22050 n=6 ∑ h2ij = 4 MSE = 145 with 20 degrees of freedom Performing the F-tests we see that there is evidence for a difference in durability between paints (F = 25.86 ∼ F1,20 , p < 0.001), and between the two application methods (F = 125.17 ∼ F1,20 , p < 0.001). There is no evidence that the effect of the application method on durability differs between the two paints (F = 1.03 ∼ F1,20 , p = 0.68), i.e. there is no evidence that application method and paint interact (see factorial experiments). Mean Durability for Paint A Mean Durability for Paint B = = 1 2 (100 + 40) 1 2 (120 + 70) = 70.00 = 95.00 A 95% confidence interval for the difference in durability between paint B and A: r 25 ± t20 2 × 145 = [14.7; 35.3] 12 So, paint B is estimated to last, on average, between 14.7 and 35.3 months longer than paint A. Note the 12 in the denominator when 62 sta2005s: design and analysis of experiments calculating the standard error: the mean for paint A is based on 12 observations, so is the mean for paint B. Mean Durability using Brush Mean Durability using Spray = = 1 2 (100 + 120) 1 2 (40 + 70) = 110.00 = 55.00 Exercise: Construct a confidence intervals for the brush and interaction contrasts. Overall the above information suggests that the brush method gives a more durable surface, and that the best combination for durability is paint B applied with a brush. The brush method is preferable to the spray method irrespective of which paint is used. 4.3 Multiple Comparisons: The Problem We have so far avoided the Neyman-Pearson paradigm for statistical hypothesis testing. However, for discussing the problem of multiple comparisons, it can be useful to temporarily revert to thinking in terms of making a decision based on some predetermined cut-off level, i.e. reject, or fail to reject H0 . We know that when we make a statistical test, we have a small probability α of rejecting the null hypothesis when true (α = Type I error). In the completely randomised design, the means of the groups fall naturally into a family, and statements we make will be made in relation to the family, or experiment as a whole, i.e. we cannot ignore what other tests have been performed when interpreting the outcome of any single test. We would like to be able to control the overall Type I error, also called the experiment-wise Type I error rate, i.e. the probability of rejecting at least one hypothesis that is true. Controlling the Type II error (accepting at least one false hypothesis) is more difficult, as for this we would need to know the true differences between the means. What is the overall Type I error? We cannot say exactly but we can place an upper bound on it. Consider a family of k tests. Let Ei be the event {the ith hypothesis is rejected when true}, i = 1, . . . , k, i.e. the event that we make a type I error in test i. Then for test i, P(Ei ) = αi , the significance level of the ith test. ∪1k Ei = {At least one hypothesis rejected when true} and 63 sta2005s: design and analysis of experiments P(∪1k Ei ) measures the overall probability of a Type I error. Extending the result P( E1 ∪ E2 ) = P( E1 ) + P( E2 ) − P( E1 ∩ E2 ) to k events, we have that P(∪1k Ei ) = k ∑ P(Ei ) − ∑ P(Ei ∩ Ej ) + ∑ 1 i< j P( Ei ∩ Ej ∩ Ek ) . . . (−1)k P( E1 ∩ . . . ∩ Ek ) i< j<k An upper bound for this probability can be obtained: k P(∪ Ei ) ≤ ∑ P(Ei ) = kα if all P( Ei ) = α (4.1) 1 This is called the Bonferroni inequality. The Bonferroni inequality implies that when conducting k tests, the overall probability of a Type I error can be as bad as k × α. For example, when conducting 10 tests, each with a 5% significance level, the probability of one or more Type I errors (wrong decisions) is 50%. In the rare case of independent tests, we find P(∪1k Ei ) = 1 − P(∪1k Ei ) = 1 − P(∩1k Ei ) = 1 − ∏1k P( Ei ) Ei ’s are independent = 1 − ∏1k (1 − αi ) = 1 − (1 − α ) k if P( Ei ) = α for all i 4.4 To control or not to control the experiment-wise Type I error rate In experiments, we conduct a large number of tests/comparisons when comparing treatments. These comparisons can often be phrased as contrasts. Data, and statistical quantities calculated from these data (estimates, tests statistics or confidence intervals) are random outcomes. Suppose we want to compare several treatments, either in the form of null hypothesis tests or confidence intervals. Each of these tests, if the null hypothesis is true, has a small chance of resulting in a 64 sta2005s: design and analysis of experiments 65 type I error (because of the random nature of data). If I conduct many tests, a small proportion of the results will be Type I errors. Type I errors lead to false claims and recommendations in health, science and business, and therefore should be avoided (we can’t avoid them completely, but we can control / reduce the probability of making type I errors.) For example, if I conduct a completely randomised design experiment with 20 treatments (including control treatments) for repelling mosquitoes from a room, and I conduct multiple comparisons to find out which are more effective than doing nothing (one of the control treatments), and compare them to each other1 , I will find treatments that seem to be more effective than others. Some of these findings may be real, some will be flukes/spurious results. The problem is: how do I know? And the answer to that is: I can’t be certain, unless I accumulate more evidence for or against the particular null hypotheses. We have a dilemma. If we control (using Bonferroni, Tukey, Scheffé or any other of the many methods available), Type II error rate increases = power decreases, meaning we could be missing some interesting differences. If we do NOT control, we could end up with a large Type I error rate (detect differences that are not real). Exploratory vs confirmatory studies The following summarizes my personal opinion on how one can approach the above dilemma. Studies (observational or experimental) can often be classified as exploratory, or confirmatory. In a confirmatory study you will know exactly what you are looking for, will have a-priori hypotheses that you want to test, you are collecting more evidence towards a particular goal, e.g. do violent video games result in more aggressive behaviour? Confirmatory studies will mostly have only a few a-priori hypotheses, and therefore not such a large experiment-wise Type I error rate. Also, often, the null hypothesis might not be true (because we are looking to confirm a difference/effect). If the null hypothesis is false, it is not possible to make a Type I error, and any method to control for Type I errors would just increase the Type II error rate. On the other hand, in exploratory studies we have no clear expectations, we are just looking for patterns, relationships. Here we don’t need to make a decision (reject or not a null hypothesis). But these are often the studies that generate a very large number of tests, and hence a potentially very large number of Type I errors. If we control Type I errors, we might be missing some of the If I conduct all pairwise comparisons, there are (20 2 ) = 190 tests. More, if I add other contrasts. 1 sta2005s: design and analysis of experiments interesting patterns. A solution here is to treat the results with suspicion/caution: be aware that many of the small p-values may not be repeatable, and need to be confirmed before you can become more confident that an effect/difference is present. Use these studies as hypothesis generating, and follow up with confirmatory studies. Use p-values to measure the strength of evidence, avoid the Neyman-Pearson approach. Declare whether your study is exploratory or confirmatory. This makes it clear to the reader how to interpret your results. A big problem is of course if one wants to show that something DOES NOT HAVE AN EFFECT, e.g. that violent video games do not increase aggression. Remember that a large p-value would not mean that the null hypothesis is true or even likely. It could mean that your power is too low to detect the effect. In this case it is important to make sure that power is sufficient to detect an effect, AND remember that large p-values DO NOT mean that there is NO effect. Here we really want to compare evidence for H0 vs evidence for H1. This is not possible with p-values or the classical statistical hypothesis tests. One solution to this is to calculate a likelihood ratio (related to Bayes factors): likelihood under null hypothesis, vs likelihood under the alternative hypothesis, as a measure of how much support the null vs the alternative hypothesis has. This of course would also need to be repeated, as the likelihood ratio is just as prone to spurious outcomes as any other statistic. To find out if a result was a type I error or not (whether an effect is real or not), one can follow up with other studies. Secondly, it is important to admit how many tests were conducted, and which of these were post-hoc, so that it is clear what the chances of type I errors are. Also consider how plausible the null hypothesis is in the first place, based on other information and context. If the null hypothesis is likely (e.g. you would never expect this particular gene to be linked to a particular disease, e.g. cancer), but your statistical analyses have just indicated a small p-value, you should be very suspicious, i.e. suspect a type I error. Also if we don’t use the Neyman-Pearson approach (i.e. don’t make a decision), but just report p-values (collecting evidence), we need to be aware that small p-values can occur even if the null hypothesis is true (e.g. P( p < 0.05| H0 ) = 0.05). Each observation can be used either for exploratory analysis or for confirmatory analysis. Not for both. If we use the same data to generate and check a hypothesis, the results will be over-optimistic, they will just confirm what we have already seen in the exploratory study, and won’t give us an independent test of the hypothesis. This is the same principle we used in regression when we split our data into training and testing sets. In experimental studies we usually don’t have the luxury to split data sets (too small), but we 66 sta2005s: design and analysis of experiments need to be very careful about how we set up tests, and how we interpret p-values. 4.5 Bonferroni, Tukey and Scheffé Having said all of the above and having recommended not to correct, you should still know that in many fields controlling the Type I error rate is expected. Bonferroni’s, Tukey’s and Scheffś methods are the most commonly used methods to control the Type I error rate, although there are many, many more. For Bonferroni’s method we need to know exactly how many tests were performed. Therefore, this method is mainly used for a-priori hypotheses. Also, it is very conservative (see Bonferroni’s inequality, it corrects for the worst-case scenario). If the number of tests exceeds approximately 10, the correction is so severe, that only the very large effects will still be picked up. Tukey’s method is used to control the experiment-wise Type I error rate when all pairwise comparisons between the treatments are performed. Scheffés method is used when lots of contrasts, not all of which are pairwise, are performed. Many special methods have been devised to control Type I error rates, also sometimes referred to as false positives. We discuss some methods for Type I error rate adjustment that are commonly used, but many others are available for special problems, such as picking the largest mean, or comparing all treatments to a control (see the list of references at the end of this chapter). The methods we will consider here are: • Bonferroni’s correction for multiple testing • Tukey’s honestly significant difference (HSD) • Scheffé’s method These methods control the experiment-wise type I error rate. Bonferroni Correction Bonferroni’s method of Type I error rate adjustment can be quite conservative (heavy). Therefore, it is only used in cases for which there is a small number of planned comparisons. The correction is based on the Bonferroni inequality. 67 sta2005s: design and analysis of experiments 1. Specify m contrasts that are of interest before looking at the data. α th 2. Adjust the percentile of the t-distribution. Use the ( 2m ) α th percentile instead of the ( 2 ) for each statement (Appendix Table 7). In other words, the significance level used for each individual test/contrast/confidence interval is αC = αmE , where α E is the experiment-wise type I error rate, or the maximum Type I error rate we are willing to allow over all tests. Often, α E = 0.05. The Bonferroni inequality ensures that the probability that all m intervals cover the true parameters is at least (1 - α), i.e. the probability of no Type I error. Confidence intervals: Given m statements of the form ∑ hi αi , the confidence intervals have the form a ∑ ( α ) hi Ȳi· ± tν 2m " s 1 2 a ∑ 1 h2i ni #1 2 α α th Where t 2m is the ( 2m ) percentile of tν , ν = degrees of freedom for 2 s (MSE, the mean square for error). For example, if we have decided on five comparisons (a-priori), we would make each comparison at the (two-sided) 1 % level. Then the probability that all five confidence intervals cover the true parameter values is at least 95%. Hypothesis tests: H0 : ∑ hi αi = 0 H A : ∑ hi αi ̸ = 0 There are different approaches: 1. Reject H0 if α ∑ hi Ȳi· 2m 1 > tν s ∑ h2i ni 2 2. or, equivalently, calculate each p-value as usual, but then reject only if p < α E /m, where α E is the experiment-wise type I error we are willing to allow. 3. adjusted p-value = min( p-value × m, 1) 68 sta2005s: design and analysis of experiments Tukey’s Method Tukey was an American mathematician, known for developing the box plot, the Fast Fourier Transform algorithm and for coining the term ‘bit’ (binary digit). Tukey’s method is based on the Studentised Range which is defined as follows and is another approach to control the overall type I error rate when performing multiple comparisons. Let X1 , . . . , Xk be independent N(µ, σ2 ) variables. Let s2 be an unbiased estimator of σ2 with ν degrees of freedom such that νs2 ∼ χ2ν . Let R = max(xi ) - min(xi ) = range of the { xi }. The σ2 Studentised range is the distribution of q = Rs . The parameters of the Studentised range are a (the number of values (xi )) and ν (the degrees of freedom of s2 ). The upper α% point of q is denoted by qαk,ν , i.e. P(q ≥ qαk,ν ) = α (see Tables 2 and 3). Tukey’s method is used when we want to make pairwise comparisons between a treatment means (µi − µ j ). It corrects confidence intervals and hypothesis tests for all these possible pairwise comparisons. Let’s say we are comparing a total of a treatment means. We assume that we have the same number of observations per treatment/group, n, say. The appropriate standard error will be q s2 n, the SE for a treatment mean. Under the null hypothesis of no differences between means P √ R s2 /n ≥ qαa,ν =α and, by implication, the probability that any difference, under H0 , exceeds this threshold is at most α, i.e. at most α × 100% of any pairwise differences are expected to exceed the α threshold of the studentised range distribution. To construct confidence intervals, let ∑ hi αi be a contrast of the form L = Ȳi. − Ȳj. (with the sum of positive hi ’s = 1 and ∑ hi = 0). Then a confidence interval adjusted for all possible pairwise comparisons is s L̂ ± qαa,ν √ n The overall experiment-wise type I error will be α (see Tables 2 and 3 in Appendix). Here s2 = MSE with ν degrees of freedom and qαa,ν is the αth percentile of the Studentised range distribution. The (Tukey-adjusted) p-value is calculated as 69 sta2005s: design and analysis of experiments P q a,ν L̂ > √ s/ n 70 1. Using Tukey’s method we can construct as many intervals as we please, either before or after looking at the data. The method allows for all possible contrasts to be examined. √ 2. All intervals have the same length, which is 2qαa,ν s/ n. 3. Tukey’s method gives shorter intervals for pairwise comparisons (compared to Scheffé’s method), i.e. it has more power, and is thus used almost exclusively for pairwise comparisons. 4. Tukey’s method is not as robust against non-normality as Scheffé’s method. 5. Tukey’s method requires an equal number of observations per group. For unequal numbers see Spötvoll et al. (1973). 6. Under the Neyman-Pearson approach, the hypothesis H0 : ∑ hi αi = 0 is rejected if ∑ hi Ȳi· √s n > qαa,ν or, equivalently, we can define Tukey’s Honestly Significant Difference as s HSD = qαa,ν √ . n Then any two means that differ by more than this HSD are said to be significantly different. Scheffé’s Method Scheffé’s method of correcting for multiple comparisons is based on the F-distribution. It can be used for all contrasts ∑ hi αi . In practice Scheffé’s method is used if many comparisons are to be done, not all pairwise. The intervals are 1 a h2 1 ∑ hi Ȳi· ± ((a − 1) Faα−1,ν ) 2 (s2 ∑ nii ) 2 1 The p-value is calculated as P S> L̂ SE( L̂) where S is the Scheffé-adjusted reference distribution with a − 1 and ν degrees of freedom. This is of the form L̂ ± c × SE( L̂), where c is the new critical constant. The last part is the standard error of the contrast. Also note how the factor ( a − 1) with which the F-quantile is multiplied will stretch the reference distribution to the right, making the observed values less extreme. sta2005s: design and analysis of experiments 1. Scheffé’s method is better than Tukey’s method for general contrasts, but intervals are longer for pairwise comparisons. 2. Intervals are longer than Bonferroni intervals, but we do not have to specify them in advance. 3. Scheffé’s method covers all possible contrasts, but this makes the intervals longer because protection is given for many cases of no practical interest. 4. Robust against non-normality of data. 5. Can be used in multiple regression as well as ANOVA, in fact anywhere where the F-test is used. 6. When the hypothesis of equal treatment means was rejected in the ANOVA, there will be at least one significant difference among all possible contrasts. No other method has this property. 7. Under the Neyman-Pearson paradigm, to test H0 : ∑ hi αi = 0 reject if 1 ∑ hi Ȳi· 2 1 > (( a − 1) Fa−1,ν ) h2 2 s ∑ ni i Example: Strength of Welds In this experiment the strengths of welds produced by four different welding techniques (A, B, C, D) were compared. Each welding technique was used to weld five pairs of metal plates in a completely randomized design. The average strengths were: Technique: Mean: A 69 B 83 C 75 D 71 The estimate of experimental error variance for the experiment was MSE = 15 with 16 degrees of freedom. We are going to use all three methods to control the Type I error rate on this example, although in practice one would probably only use 1, depending on the type of contrasts. Suppose we had planned 3 a-priori contrasts: compare every technique to C, maybe because C is the cheapest welding technique, and we will only start using another technique if it yields considerably stronger welds. 71 sta2005s: design and analysis of experiments These are pairwise comparisons, but we can still use Bonferroni’s method. We are going to assume that the number of replicates for each treatment was the same, i.e. 5 (check this). We could approach this by constructing a confidence interval for each contrast/difference. For example, a 95% confidence interval for the difference between techniques A and C: 1−α/(2×3) 69 − 75 ± t16 × √ 2 × 15/5 = −6 ± 2.67 × √ 6 = [−12.54, 0.54] Check that you agree with the standard error, and that you can find the critical value in table A.8 (appendix to notes). The above confidence interval tells us that we estimate the true difference in average weld strength to lie between -12.54 and 0.54 units. Most of the interval is on the negative side, indicating that there is some evidence that A produces weaker welds, but there is a lot of uncertainty, and we cannot exclude the possibility that there is not actually a difference. Confidence intervals provide much more information than hypothesis tests, and you should always prefer presenting information as a confidence interval rather than as a hypothesis test if possible. But, as an exercise, let’s do the hypothesis test also, first using the Neyman-Pearson approach (bad) and then Fisher’s approach (p-value, much better). H0 : µ A = µC H1 : µ A ̸= µC although, in this case we might want to test H1 : µ A > µC (we would get the critical value from the t-tables: t0.05/3 = t0.017 = 2.318 (qt(0.017, 16, lower.tail = F))). tobs = √ −6 = −2.449 2 × 15/5 For the two-sided test we reject if |tobs | > t0.05/(2×3) = 2.67 (Table A.8). Here, we cannot reject the null hypothesis, i.e. there is no evidence that techniques A and C differ in terms of average weld strengths. But note that this does NOT mean that they are the same in strength. We just don’t know, and might need an experiment with greater power to find out. To find the p-value we would use R: 2 * pt(-2.449, 16) = 2 × 0.013 = 0.026. This is the uncorrected p-value. To correct we multiply by 3 (Bonferroni correction): p = 0.078. Now that we have the exact (adjusted) p-value, it seems that there is possibly a difference in weld strength between techniques A and C (little, but 72 sta2005s: design and analysis of experiments not no evidence). This example shows what happens when adjusting p-values (controlling the Type I error rate). What you can conclude about techniques A and C really depends on whether this was an exploratory or confirmatory study, and what you already know about these two techniques. In any case, the p-value gives you a much better understanding of the situation than the Neyman-Pearson approach above. Bonferroni’s correction ensures that the overall probability of a Type I error, over the 3 test conducted, is at most 0.05, but note that the probability of a Type I error for each individual test (or confidence interval) is much smaller, namely 0.05/3 = 0.017. If we had wanted to make all pairwise comparisons between the treatment means, and wanted to control the experiment-wise Type I error rate, we would use Tukey’s method. Note that we could also make all pairwise comparisons without controlling Type I error, and therefore not use Tukey’s method, i.e. Tukey’s method just refers to a way of controlling Type I error rates for the particular case of pairwise comparisons. Tukey’s method is based on the distribution of the maximum difference under the null hypothesis that all means come from the SAME normal distribution, i.e. that there are no differences between treatments. It uses the studentized range distribution, which gives the density of the maximum difference (studentized range). In other words, it defines how often the maximum difference (under H0) will exceed a certain threshold. For example, if the maximum difference exceeds a value c only with a 5% probability, that tells us that 95% of the time the maximum observed difference (in a sample of size a, under H0) should be less than c, and hence the probability that ANY difference should be less than c is 0.95, and, voila, we have fixed the experiment-wise type I error at a maximum of 5%. The weld example had 4 treatments. This makes (42) = 6 possible different pairwise comparisons. There is a trick to do this, which unfortunately only works for the Neyman-Pearson approach: Sort the means from smallest to largest; calculate the HSD (honestly significant difference); any difference between any two means which exceeds the HSD is declared ‘significant’. Technique: Mean: HSD = q0.05 4,16 × √ A 69 D 71 C 75 15/5 = 4.05 × B 83 √ 3 = 7.01 √ Note: The 15/5 part here is NOT the standard error of a difference, but refers to the standard deviation of a mean (in the 73 sta2005s: design and analysis of experiments studentized range distribution the standard deviation of the values being compared). The critical value can be found in Table A.3 (see the notes below Table A.3 to help you understand what the rows and columns refer to). We now go back to our sorted means, and draw a line under means which are NOT significantly different: Technique: Mean: A 69 D 71 C 75 B 83 no sign. diff. between A and D, and A and C B is sign. diff. from C, and hence from all others This can be interpreted as follows: There is evidence that B is stronger than A, D, and C, but there is no evidence that A, D and C differ in mean weld strength. It is unlikely (< 0.05) that even the largest difference in a sample of means would exceed 7.01 under H0, so we now have fairly strong evidence that B produces stronger welds than the other 3 welding techniques. If our means had been slightly different (C = 77), this whole procedure would change to: Technique: Mean: A 69 D 71 C 77 B 83 This can then be interpreted as follows: There is evidence that B is stronger than A and D, but not C, C is stronger than A, but not D, and no evidence that A and D differ in strength. Lastly, we will briefly use Scheffé’s method to construct a confidence interval for the difference between B and the other 3 techniques. Scheffé’s method for controlling Type I error rates is used when we have lots of contrasts (explicit or implicit), and they are not only the pairwise comparisons, but may be more complicated, as the one above. We wouldn’t use Scheffé’s method if the contrast above was the only contrast we were going to test, but usually, once you have invested time and money into conducting an experiment, you might as well try and get all the information you can from it (even if you only use the results to generate hypotheses for a future experiment). Scheffé’s method is very similar to a t-test, except that the critical value is not from a t-distribution but rather q ( a − 1) Faα−1,ν 74 sta2005s: design and analysis of experiments where a is the total number of treatment means in the experiment, and ν is the error degrees of freedom. The factor ( a − 1) has the effect of shifting the critical region to the right, i.e. reducing type I error of the individual test, but also reducing power of the individual test. A 95% confidence interval for the difference between B and the average of the other 3 techniques is found as follows: L̂ ± c × SE( L̂) Let’s first find the standard error = √ Var of the contrast: µ̂ + µ̂ D + µ̂C Var ( L̂) = Var µ̂ B − A 3 15 1 15 + = 3× 5 9 5 =4 Then a 95% confidence interval: (83 − q √ 69 + 71 + 75 ) ± (4 − 1) F40.05 × 4 − 1,16 3 √ = 11.33 ± 3 × 3.24 × 2 = [5.095; 17.57] The F-value is from the usual F table (Table A.6). a is the total number of treatment means involved in the contrasts, and ν is the error degrees of freedom (from the ANOVA table). The above confidence interval tells us that mean weld strength with technique B is estimated to be between 5.1 and 17.6 units stronger than the average strength of the other 3 techniques. From this confidence interval we can learn how big the difference is, and how much uncertainty there is about this estimate. The whole range of the confidence is positive, which indicates that B results in stronger welds than the other 3 techniques on average. This confidence interval gives us the same conclusion as pairwise comparisons did above, it is just answering a slightly different question. In other words, the contrast should depend predominantly on the question you want to answer. 4.6 Multiple Comparison Procedures: The Practical Solution The three methods discussed above all control for the significance level, α, and protect against overall Type I error. Controlling the 75 sta2005s: design and analysis of experiments significance level causes some reduction in power and increases the Type II error (accepting at least one false hypothesis). Very little work has been done in multiple comparison methods that protect against a Type II error (accepting at least one false hypothesis). A practical way to increase power (i.e. lower the probability of making a Type II error) is by raising the significance level, e.g. use α = 10%. The paper by Saville (Saville, DJ. 1990. Multiple Comparison Procedures: The Practical Solution. The American Statistician 44: 174–180. http://www.jstor.org/stable/2684163), provides a good overview of the issues in multiple and unplanned comparisons, and gives practical advice on how to proceed. We recommend Saville’s approach for this course and also for your future work as a statistician. The problem of multiple testing does not only occur when comparing means from experimental data. It also occurs in stepwise procedures when fitting regression models, or whenever comparing a record to many others, e.g. DNA records in criminology, medical screening for a disease, and many other areas. Behind all of this is uncertainty, randomness in data. Humans want certainty, but this can never be achieved from a single experiment, or a single data point or even a large single data set, because of the inherent nature of data; we can approach certainty only by accumulating more and more evidence. Using a Neyman-Pearson approach to hypothesis testing can make the problem worse: By being forced to make a decision, one automatically makes an error every now and then, and these errors accumulate as more decisions are made. Once should take a sceptical approach and never treat results from a single experiment or data set as conclusive evidence. Always keep in the back of your mind how many tests were done and that some of the small p-values will be spurious results that happened because of some chance outcome in the particular data set. Discuss your results in this light, including the above warning. Especially in the case where we look for patterns or interesting differences (unplanned comparisons), i.e. outcomes we didn’t expect, we should remain sceptical until further experiments confirm the same effect. In other words, interesting results found in an exploratory study, can at most generate hypotheses that need to be corroborated by replicating results. On the other hand, when we have an a-priori hypothesis that we can test with our data, we take a much greater step in the process towards knowing if an effect is real or not: therefore the importance of a-priori hypotheses (planned before having seen the data). 76 sta2005s: design and analysis of experiments 4.7 Summary You now know how to answer specific questions about how treatments compare. And how to interpret these results considering the problem of multiple comparisons, and the plausibility of the null hypotheses you are testing. This is practically the most important part of the experiment: HOW do the treatments differ, and what can you actually learn and say about the treatments (considering all the uncertainty involved in data and statistical tests and estimates). The methods in this chapter do not only apply to completely randomized designs, but to all designs. The aim of most experiments is to compare treatments. One situation where you would not compare treatments in the way we have done above, is where the levels of the treatments are levels of a continuous treatment factor, e.g. you have measured the response at temperature levels 20, 40, 60, 80 degree Celsius. It would not make sense to do all pairwise comparisons here. Rather you would want to know how the response changes as a (continuous) function of temperature. This is done using special contrasts called orthogonal polynomials. 4.8 Orthogonal Polynomials In experiments, the treatments are often levels of a quantitative variable, e.g. temperature, or amount of fertilizer. In such a case one might be interested in how the response changes with increasing X, similarly as we do in linear regression. Suppose the levels are equally spaced, such as 10◦ , 20◦ , 30◦ in the case of temperatures, and that there is an equal number of replications for each treatment. The mean response Y may be plotted against X, and we may wish to test if there is a linear or quadratic or cubic relationship between them. These relationships can be described by polynomials (linear, quadratic, third order, etc.). For the analysis of experiments this is often done using orthogonal polynomials. In regression you will have come across using polynomial terms to account for non-linear relationships. Orthogonal polynomials have the same purpose as polynomial regression terms. They have the advantage of being orthogonal, which means that the terms are independent. This avoids problems of collinearity, and allows us to identify exactly which component(s) of the polynomial are important in describing the relationship. The hi coefficients to construct these orthogonal polynomials can be found in Table 4.1. They are used to define linear, quadratic, cubic 77 sta2005s: design and analysis of experiments polynomial contrasts in the treatment means. We can test for the presence of each of these contrasts using an F-test, similarly as for orthogonal contrasts. In effect they create orthogonal contrasts which test for the presence of a linear component, a quadratic component, etc.. If the number of treatments are not equally spaced or the number of observations differs between treatment levels the table cannot be used, but there is a regression approach that achieves exactly the same: split the relationship into linear, quadratic, etc. components. The main objective in using orthogonal polynomials is to find the lowest possible order polynomial which adequately describes the relationship between the treatment factor and the response. No. of Levels 3 4 5 6 7 8 Order 1 2 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 -1 +1 -3 +1 -1 -2 +2 -1 +1 -5 +5 -5 +1 -3 +5 -1 +3 -7 +7 -7 +7 Ordered Treatment Number 2 3 4 5 6 7 0 +1 -2 + 1 -1 + 1 + 3 -1 - 1 + 1 +3 -3 + 1 -1 0 + 1 +2 -1 -2 -1 +2 +2 0 -2 +1 -4 +6 -4 +1 -3 -1 + 1 +3 +5 -1 -4 -4 -1 +5 +7 +4 -4 -7 +5 -3 +2 +2 -3 +1 -2 -1 0 +1 +2 +3 0 -3 -4 -3 0 +5 +1 +1 0 -1 -1 +1 -7 +1 +6 +1 -7 +3 -5 -3 -1 +1 +3 +5 +1 -3 -5 -5 -3 +1 +5 +7 +3 -3 -7 -5 -13 -3 +9 +9 -3 -13 Table 4.1: Orthogonal Polynomial Coefficients 8 +7 +7 +7 +7 Example: If we have 4 treatments we can construct 3 orthogonal polynomials, and can thus test for the presence of a linear, quadratic and cubic effect. We would construct 3 orthogonal contrasts as follows: L1 = −3Ȳ1. − 1Ȳ2. + 1Ȳ3. + 3Ȳ4. L2 = +1Ȳ1. − 1Ȳ2. − 1Ȳ3. + 1Ȳ4. L3 = −1Ȳ1. + 3Ȳ2. − 3Ȳ3. + 1Ȳ4. L1 is used to test for a linear relationship, L2 for a quadratic effect, Divisor 2 6 20 4 20 10 14 10 70 70 84 180 28 28 84 154 154 168 168 264 616 λ 1 3 2 1 10/3 1 1 5/6 35/12 2 3/2 5/3 7/12 2 1 1/16 7/12 2 1 2/3 7/12 78 sta2005s: design and analysis of experiments etc. (Order 1 = linear; order 2 = quadratic; order 3 = cubic; order 4 = quartic). For calculating corresponding sums of squares: Denote the coefficients by k ij (obtained from the table above). The divisor for p calculating sums of squares is D = ∑1 k2ij (dot-product, also given in nL2 table). SSi = Di is the sum of squares associated with the ith order SSi , with n the number of term. The test statistic is F = MSE observations per treatment. This tests H0 : Li = 0. relative weight λ = factor used to convert coded coefficients to regression coefficients. We will not be using this. 0.2 0.1 0.0 −0.1 −0.2 Suppose we want to test for the presence of a linear effect. Then H0 : Llinear = 0, i.e. no linear effect is present. A large L̂linear together with a small p-value suggest the presence of a linear effect. Often, we want to find the lowest order polynomial required to adequately describe the relationship. This is a principle in statistical modelling called parsimony: find the simplest model with the fewest variables and assumptions which has the greatest explanatory power. For this we investigate ‘lack of fit’ after fitting each term sequentially, to check whether we need any more higher-order terms. 1. Compute SSlin (for the linear effect) 2. Compute SS LOF = SS A − SSlin with ( a − 1) − 1 df. LOF = lack of fit. SS LOF measures the unexplained variation (lack of fit) after having fitted the linear term. 3. Compare SS LOF to MSE: SS LOF /( a − 2) ∼ Fa−2,ν MSE If there is evidence of lack of fit (large F, small p), we need more terms (add another term in the polynomial). If not, stop. 4. Compute SSquad . 5. Compute SS LOF = SS A − SSlin − SSquad with ( a − 1) − 2 df. 6. Test lack of fit etc. Example The following data are from an experiment to test the tensile strength of a cotton fibre used to manufacture men’s shirts. Five different qualities of fibre are available with percentage cotton x Figure 4.1: The first six orthogonal polynomial basis functions 79 sta2005s: design and analysis of experiments contents of 15%, 20%, 25%, 30% and 35% and five measurements were obtained from each type of fibre. The treatment means and ANOVA table are given below: cotton % mean 15 9.8 20 15.4 25 17.6 30 21.6 35 10.8 --------------------------------------------Df Sum Sq Mean Sq F value Pr(>F) cotton Residuals 4 476 118.9 20 161 8.1 14.8 9.1e-06 --------------------------------------------- Percentage cotton has a significant effect on strength. We now partition the treatment sum of squares into linear, quadratic and cubic effects using the coefficients table: mean .L .Q .C .4 15 9.8 -2 2 -1 1 20 15.4 -1 -1 2 -4 % Cotton 25 30 17.6 21.6 0 1 -2 -1 0 -2 6 -4 35 10.8 2 2 1 1 D 10 14 10 70 Li 8.2 -31 -11.4 -21.8 sum SSi 33.6 343 65 33.9 476 F 4.15 42.35 8.02 4.19 The null hypothesis for the .L line (linear effect) is H0 : there is no linear component in the relationship between % cotton and strength. 1. H0 : no linear effect of % cotton lin using F = SS MSE = 4.15. Comparing with F1,20 , we conclude that there is some evidence for a linear effect of % cotton (p = 0.055). 2. To test whether we need any higher order terms other than a linear term, we do a lack of fit test: H0 : All higher order terms are zero: we find SS LOF = SS A − SSlin = 476 − 33.6 = 442.4 with 3 df These are the unexplained sums of squares, all that the linear term cannot explain. The F-statistic is F= 442.4 MS LOF = 3 = 18.21 ∼ F3,20 MSE 8.1 p 0.055 < 0.001 0.010 0.054 80 sta2005s: design and analysis of experiments 81 0.05 = 18.3, p < 0.001. There is strong evidence for lack of fit. So F3,20 we add a quadratic term, and repeat. Here is a table that summarises the lack-of-fit tests. The null hypothesis tested in each line is that there is no lack of fit after having added the corresponding term (and all preceding), i.e. all higher order contrasts equal 0. ----------------------------------------SS.lof df.lof linear LOF 442.140 F.lof p.lof 3 18.285 0.000 quadratic LOF 98.926 2 6.137 0.008 cubic LOF 33.946 1 4.212 0.053 0.000 0 quartic LOF NaN NaN ----------------------------------------- 25 ● ● ● 20 ● ● strength 3. We need a linear, quadratic and cubic term. We always keep all lower-order terms in the model! There is little evidence that the quartic effect improves the fit, so for simplicity (parsimony) we prefer the simpler cubic model. Note that the cubic relationship (regression equation) is a linear combination of the intercept, linear, quadratic and cubic term. To describe the relationship between tensile strength and percentage cotton content we use a cubic polynomial (see fitted curve in Figure 4.2). ● ● ● 15 ● ● ● ● ● ● 10 ● ● ● 4.9 References 1. Abdi, H. and Williams, L. 2010. Contrast Analysis. In: Salkind N. (Ed.), Encyclopedia of Research Design. Sage. (This is a very good overview of contrasts). https://www.utd.edu/~herve/abdi-contrasts2010-pretty.pdf. 2. Miller, R.G (Jnr.) (1981). Simultaneous Statistical Inference. 2nd edition. Springer. 3. Dunn. O.J. and Clarke, V. (1987). Applied Statistics: Analysis of Variance and Regression. Wiley. 4. O’Neil, R.O. and Wedderburn, G.B. (1971). The Present State of Multiple Comparisons. Journal of the Royal Statistical Society Series B, 33, 218–244. 5. Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley and Sons. 6. Peteren, R. (1986). Designs and Analysis of Experiments. Marcel Dekker. 7. Spjøtvoll, E. and M. R. Stoline (1973). An Extension of the T-Method of Multiple Comparison to Include the Cases with Unequal Sample Sizes. Journal of the American Statistical Association 68, 975–978. 15 ● 20 25 30 35 percentage cotton Figure 4.2: Tensile strength of cotton fibre with respect to percentage cotton. Dots are observations. The line is a fitted cubic polynomial curve. sta2005s: design and analysis of experiments 8. Ruxton, G. D. and G. Beauchamp (2008). Time for some a priori thinking about post hoc testing. Behavioral Ecology 19, 690–693. 9. Tukey, J. W. (1949). Comparing individual means in the analysis of variance. Biometrics 5, 99–114. 10. Scheffé, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika 40, 87–104. 11. Scheffé, H. (1959) The Analysis of Variance. Wiley. 12. Saville, D.J. (1990) Multiple comparison procedures: the practical solution. The American Statistician 44, 174–180. 82 5 Randomised Block and Latin Square Designs We have seen how to analyse data from a single factor completely randomised design, using a one-way ANOVA. Completely randomized designs are used when the experimental units are homogeneous or similar. In this chapter we will look more closely at designs which have used blocking factors (one or two), but still a single treatment factor. Recall that blocking is done to reduce experimental error variance. This is done by separating the experimental units into blocks of similar (homogeneous) units. This makes it possible to account for the differences between the blocks, thereby reducing the experimental (remaining) error variance. Any differences in experimental units which are not blocked for or measuered will end up in the error variance. Natural blocks may be: 1. A day’s output on a machine, a batch of experimental material. 2. Age or sex of subjects in the experiment. 3. Animals from the same litter; people from same town. 4. Times at which observations are made. 5. The positions of the experimental units when they occur along a gradient in spatial settings (e.g. from light to dark, or from lots of nutrients to few nutrients). The experimental units within a block should be homogeneous so that ideally the only thing that can affect the response will be the different treatments. The treatments are assigned at random to the units within each block so that a given unit is equally likely to receive any of the treatments. Randomization minimizes the effects of other factors that may influence the result, but which may not have been blocked out. One does not usually test for block differences - if the blocking was successful, the F statistic will be greater than 1. sta2005s: design and analysis of experiments 84 Typical blocking factors include age, sex, material from the same batch, time (e.g. week day, year), spatial gradients. Ideally (easiest to analyse and interpret results), we would have each treatment once in every block. This means that each block has the same number of experimental units. Often it is worth choosing the experimental units in such a way that we can have such a complete randomized block design. Randomization is not complete but restricted to each block. This means we randomly assign the a treatments to the a experimental units in block 1, then randomize in block 2, etc.. The main purpose of blocking is to reduce experimental error variance (unexplained variation). This increases power and precision of estimates. If there is lots of variation between blocks, this variation, that would otherwise end up in the experimental error variance, is now absorbed into the block effects (variation due to blocks). Therefore, experimental error variance can be reduced considerably if there are large differences between the blocks. If there are only small differences between blocks, error variance will not decrease very much. Additionally we will loose error degrees of freedom, and may end up with less power. So it is important to consider carefully at the design stage whether blocks are necessary or not. Model Yij = µ + αi + β j + eij , eij ∼ N (0, σ2 ) ∑ αi = ∑ β j = 0 β j is the effect of block j, i.e. the change in mean response with block j relative to the overall mean. Again, the (identifiability) ● <−−−−−− gradient −−−−−−− E C A D B D A E ● C E ● A B C ● D E B D E C B ● 8 6 ● C D ● 4 0:10 B ● A ● 2 The experimental units on the light blue side of the field are more similar and thus grouped into a block, the experimental units (plots) on the dark blue side are grouped together. Experimental units within blocks are similar (homogeneous), but differ between blocks. 0 Here is an example that could represent an agricultural experiment. Typically in agriculture one needs to use fields that are not homogeneous. For example, the one side is higher up, has fewer nutrients, less water-logged. Agricultural experiments almost always use designs with blocks. 10 Randomized block designs are used when the experimental units are not very homogeneous (similar). Similar experimental units are grouped together in blocks. A ● ● 0 2 4 6 8 10 0:10 Figure 5.1: Randomized Block Design sta2005s: design and analysis of experiments constraints are required because the model is over-parametrized; they ensure unique estimates. There are two subscripts: i refers to the treatment, j refers to the block (essentially the replicate). The block effect is defined exactly like we defined the treatment effect before: the difference between the block mean and the overall mean, i.e. the change in average/expected response when the observation is from block j relative to the overall mean. E(Yi j) = µ + αi + β j i.e. the observation from treatment i and block j is made up of an overall mean, a treatment i effect and a block j effect (and the effect of treatment i does not depend on what block we have, i.e. the effects are additive = there is no interaction between the blocking and the treatment factor). If we take µ over to the LHS of the equation, the deviation of the observed value from the overall mean equals the sum of 3 effects (or deviations from a mean): treatment effect, block effect and experimental unit effect (or error term). If we assume that the effect of treatment i is the same in every block, then the average (over all blocks j) of the Yij − Ȳ.j deviations will give us an estimate of the effect of treatment i. If we cannot make this assumption, i.e. the effect of treatment i depends on the block, or is different in every block (there is an interaction between the blocking and the treatment factor), then the treatment effect in block j is confounded with the error term of the observation (because there is no replicate of treatment i in block j). The above is the central, crucial idea to understanding how randomized block designs work, how we get the estimates and why the no interaction assumption plays a role. Here is another way to look at it. Assume we have just two blocks and 2 treatments. Assume the effect of treatment 2 is the same in every block (here slightly increases response). Then we can estimate α2 by taking the average effect of treatment 2 (relative to block means). It turns out that we could also get α2 from Y2j − Ȳ.. , because ∑ β j = 0. Also look at a sketch of a data table: If you sum the values in the first column (treatment 1), you would be estimating (µ + α1 + β 1 ) + . . . + (µ + α1 + β b ) = bµ + bα1 + 0 (because ∑ β j = 0, see the identifiability constraint in the model). And the mean would be estimating µ + α. Therefore, we could estimate al phai by Ȳi . − Ȳ.. , as before. Try the same for row one. 85 sta2005s: design and analysis of experiments block 1 treatment 2 3 ... a Ȳ.1 Ȳ.2 1 2 .. . b mean mean Ȳ.b Ȳ1. Ȳ2. Ȳa. Ȳ.. In this we are assuming that αi is the same in every block. al phai essentially gives us an average estimate of how the response changes with treatment i. With no replication of treatments within blocks, this is all we can do. We CANNOT estimate a separate effect of treatment i for every block. So, when we use a randomized block design, we need to make an important assumption, namely that the effect of treatment i, αi , is the same in every block. Technically we refer to this as there is no interaction between the treatment and the blocking factors, or the effects are additive. What happens if block and treatment factors DO interact? The model would still need to make the assumption of no interaction, our estimates would still calculate average effects, but this might not be very meaningful, or not a very useful description of what happens if we use treatment i in block j. Also, the residuals and thus the experimental error variance might become quite large, because the observations deviate quite a bit from the average values. Additivity of block and treatment effects is therefore another thing we need to check, in addition to the previous normal, equal variance residual checks. A good way to check for this is through an interaction plot. Sums of Squares and ANOVA Yij = µ + αi + β j + eij Again start with the model. Take µ to the left hand side. Then all terms on the RHS are deviations from a mean: treatment means around overall mean, block means around overall mean, observations around µ + αi + β j . As we did for CRD, we can substitute observed values, square both sides and sum over all observations to obtain SStotal = SStreatment + SSblocks + SSerror Table 5.1: Sketch of data table for randomized block design. 86 sta2005s: design and analysis of experiments with ab − 1 = ( a − 1) + (b − 1) + ( a − 1)(b − 1) degrees of freedom, respectively. The error degrees of freedom can just be calculated from the rest: ( ab − 1) − ( a − 1) − (b − 1). Note that SSblocks = ∑ ∑(Ȳ.j − Ȳ.. )2 , where Ȳ.j denotes the mean in block j. Check that you can see why rij = Yij − α̂i − β̂ j − µ̂ and SSE = ∑ ∑(Yi j − Ȳ.j − Ȳi. + Ȳ.. )2 When we have data from a completely randomized design we do not have a choice about which terms need to be in the model. But just to illustrate how the blocks reduce the SSE, compare the model with block effects SStotal = SStreatment + SSblocks + SSerror to a model we would use for a single-factor CRD (essentially ignoring that we actually had blocks): SStotal = SStreatment + SSerror The SStotal and SStreatment will be exactly the same in both models. If we add block effects the SSE is reduced, but so are its degrees of freedom. MSE will only become smaller if the reduction in SSE is large relative to the number of degrees of freedom lost. Usually we are not interested in officially testing for block effects. Actually the F-test is not quite valid for block effects because we haven’t randomized the blocks to experimental units. If we want to test for differences between blocks, we can use the F-test, but remember that we cannot make causal inference about blocks. If we are only interested in whether blocks have reduced the MSE, we can look at the F-value for blocks: blocking has reduced the MSE iff F > 1 (iff = if and only if). 87 sta2005s: design and analysis of experiments 88 Example Executives were exposed to one of 3 methods of quantifying the maximum risk premium they would be willing to pay to avoid uncertainty in a business decision. The three methods are: 1) U: utility method, 2) W: worry method, 3) C: comparison method. After using the assigned method, the subjects were asked to state their degree of confidence in the method on a scale from 0 (no confidence) to 20 (highest confidence). Block 1 2 3 4 5 (oldest executives) (youngest executives) Experimental Unit 1 2 3 C W U C U W U W C W U C W C U You can see that the experimenters blocked for age of the executives. This would have been a reasonable thing to do if they expected, for example, lower confidence in older executives, i.e. different response due to inherent properties of the experimental units (which here are the executives). We have a randomized block design with blocking factor age, treatment factor method of quantifying risk premium, response = confidence in method. The executives in one block are of a similar age. If the experiment was conducted correctly, the three methods were randomly assigned to the three experimental units in each block. Here is the ANOVA table. NOTE: one source of variation (SS) for every term in the model! -----------------------------------------------> m1 <- aov(rate ~ block + method) > summary(m1) Df Sum Sq Mean Sq F value block 4 171.333 method 2 202.800 Residuals 8 23.867 42.83 101.4 Pr(>F) 14.37 0.0010081 34.03 0.0001229 2.98 ------------------------------------------------ We are mainly interested in the treatment factor method. We can use the ANOVA table to test whether method of quantifying risk affects confidence. For this test we set up H0 : α1 = α2 = α3 = 0 (method has no effect on confidence). The result of this tests suggests that average confidence differs with different methods (p = 0.0001). Table 5.2: Layout and randomization for premium risk experiment. sta2005s: design and analysis of experiments 89 What about the block effects (age)? There is evidence for differences in confidence between the age groups (p = 0.001). And because the F-value is much larger than 1, we know that we haven’t wasted degrees of freedom, i.e. by blocking for age we have been able to reduce experimental error variance, and thus to increase power to detect the treatment effects. When interpreting block effects we are only allowed to talk about differences, not about causal effects! This is because we have not randomly assigned age to the executives, and age could be confounded with many other unknown factors. We can talk about association, but not causation. Is it reasonable to assume that block and treatment effects are additive? The interaction plot can give some indication of how acceptable this assumption is. Usually we plot the treatments on the x-axis and each block is represented by a line or trace. On the y-axis we show the mean response for treatment i and block j. Because in RBDs there is mostly only a single observation for treatment i and block j, the point shown here just represent the observations. If block and treatment do not interact, i.e. method and age do not interact, the lines should be roughly parallel. Because no interaction means that the effect of method i is the same in every block, and also when changing from method 1 to 2 the change in mean response should be roughly the same. HOWEVER, remember that here the points are based on single observations, and we still expect some variation between executives, as part of natural variation between experimental units. Figure 5.2: Interaction plot for risk premium data. In R: block 10 5 confidence rating 15 4 5 3 2 1 1 2 3 method Even though the lines are not parallel here, there is no indication that the effect of method is very different in the different blocks. We also did not (and could not!) show that the residuals ARE normal, we are only worried if they are drastically non-normal. Here we are only worried if there are clear indications of interactions. There are interaction.plot(method, block, rate, cex.axis = 1.5, cex.lab = 1.5, lwd = 2, ylab = "confidence rating"), first the factor that goes on the x-axis, then the trace factor, then the response. sta2005s: design and analysis of experiments 90 not, and averaging the effects over blocks would give us a reasonable indication of what happens when the different methods of risk assessment are used. Moving beyond the ANOVA, we might now want to compare the treatment means directly to find out which method results in highest confidence, and to find out HOW BIG the differences are. We can do this exactly as we did for CRDs. For example if we compare two treatment means we need the standard error of a treatment mean, and the standard error of a difference between treatment means. SE(Ȳi. ) = q r Var (Ȳi. ) = 2.98 = 0.77 5 This is a measure of the uncertainty of a specific mean (how close it is to the true treatment mean, or how well it estimates the true treatment mean). The variance of repeated observations from this particular treatment is estimated by MSE. Each treatment mean is based on 5 observations (one from each block). The standard error for the difference between two means: r SED = 2 × MSE = 5 r 2 × 2.98 = 1.09 5 5.1 The Analysis of the RBD Suppose we wish to compare a treatments and have N experimental units arranged in b blocks each containing a homogeneous experimental units: N = ab. The a treatments, A1 , A2 , . . . A a say are assigned to the units in the jth block at random. Let Yij be the response to the ith treatment in the jth block. The linear model for the RBD is: Yij = µ + αi + β j + eij i = 1 . . . a and j = 1...b where In general, a statistical model should be as simple as possible, and only as complicated as necessary (Occam’s razor, parsimony). In ED however, the structural part of the model is dictated by the design, and not much can be added or changed (apart form interaction terms). sta2005s: design and analysis of experiments ∑ia=1 αi = ∑bj=1 β j = 0 µ αi βj eij eij ∼ N (0, σ2 ) overall mean effect of the ith treatment effect of the jth block random error of observation and are independent This model says that the response depends on a treatment effect, a block effect and the overall mean. It also says that these effects are additive. In other words we now have a × b distributions/populations corresponding to the a treatments in each of the b blocks. The means of these a × b populations are given by: µ + αi + β j ≡ the property of additivity (no interaction) The variance is assumed equal in all of these populations. What do we mean by additive effects? It means that we are assuming that the effect of the ith treatment on the response is the same (αi ) regardless of the block in which the treatment is used. Similarly, the effect of the jth block is the same (β j ) regardless of the treatment. If the additivity assumption is not valid, the effect of treatment i will differ depending on block. The response can then not be described as in the model above but we need another term, the interaction effect, which describes the difference in effect of treatment i in block j, compared to the additive model. To be able to estimate these interaction effects we need at least 2 replications of each treatment in each block. In general, for randomised block designs, we make the assumption of additivity, but then need to check this. Estimation of µ, αi (i = 1, 2, . . . a) and β j (j = 1, 2 . . . b) When assuming a normally distributed error term, the maximum likelihood and least squares estimates of the parameters are the same and are found by minimizing S = ∑i ∑ j (Yij − µ − αi − β j )2 Differentiate with respect to µ, αi and β j and set equal to 0: 91 sta2005s: design and analysis of experiments ∂S ∂µ = −2 ∑ij (Yij − µ − αi − β j ) ∂S ∂αi = −2 ∑bj=1 (Yij − µ − αi − β j ) = 0 i = 1, . . . a ∂S ∂β j = −2 ∑ia=1 (Yij − µ − αi − β j ) =0 =0 j = 1, . . . b Note the limits of the summation. Using the constraints we find the a + b + 1 normal equations abµ bαi + bµ aβ j + aµ = Y·· = Yi· = Y· j whose unique solution is µ̂ α̂i β̂ j = Ȳ·· = Ȳi· − Ȳ·· = Ȳ· j − Ȳ·· i = 1...a j = 1...b The unbiased estimate of σ2 is found by substituting these estimates into SSE to give SSresidual = ∑ij (Yij − µ̂ − α̂i − β̂ j )2 = ∑ij (Yij − Ȳ·· − (Ȳi· − Ȳ·· ) − (Ȳ· j − Ȳ·· ))2 = ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ·· )2 and σ̂2 = Sresidual ( a−1)(b−1) = ∑ij (Yij −Ȳi· −Ȳ· j +Ȳ·· )2 ( a−1)(b−1) Parameter Point Estimate Variance µ Ȳ·· αi Ȳi· − Ȳ·· σ2 ab σ 2 ( a −1) ab σ 2 ( b −1) ab 2σ2 b a 2 σ2 b ∑i hi 2 σ ( a + b +1) ab βj αi − αi ′ ∑ hi αi µ + αi + β j σ2 Ȳ· j − Ȳ·· Ȳi· − Ȳi′· with ∑ hi = 0 ∑i hi Ȳi· Ȳi· + Ȳ· j − Ȳ·· s2 92 sta2005s: design and analysis of experiments Analysis of Variance for the Randomised Block Design The model is Yij = µ + αi + β j + eij so Yij − µ = αi + β j + eij Replacing the parameters by their estimates gives Yij − Ȳ·· = (Ȳi· − Ȳ·· ) + (Ȳ· j − Ȳ·· ) + (Yij − Ȳi· − Ȳ· j + Ȳ·· ) Squaring and summing over i and j gives ∑ij (Yij − Ȳ·· )2 = b ∑i (Ȳi· − Ȳ·· )2 + a ∑ j (Ȳ· j − Ȳ·· )2 + ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ·· )2 since the cross products vanish when summed. This can be written symbolically as SStotal = SS A + SSB + SSe with degrees of freedom ( ab − 1) = ( a − 1) + (b − 1) + ( a − 1)(b − 1) Thus the total sums of squares can be split into three sums of squares for treatments, blocks and error respectively. Using the theory of quadratic forms, the sums of squares are independent and each has a χ2 distribution (Cochran’s Theorem). E(Ȳi· − Ȳ·· ) = αi Then E( MStreat ) = E h SStreat ( a −1) = E h b a −1 = σ2 + i ∑i (Ȳi· − Ȳ·· )2 b a −1 i ∑ α2i as for the CRD, except that now blocks are the replicates. 93 sta2005s: design and analysis of experiments 94 Also = b−a 1 ∑ j β2j + σ2 E( MSE) = σ2 E( MSblocks ) and So F= MS A ∼ F(a−1),(a−1)(b−1) MSE If H0 : α1 = α2 = . . . = α a = 0, then reject H0 if F > F(αa−1),(a−1)(b−1) . A If H0 is false, MS MSE has a non-central F distribution with non-centrality parameter λ= b ∑ α2i σ2 and ( a − 1) and ( a − 1)(b − 1) degrees of freedom. This distribution can be used to find the power of the F-test and to determine the number of blocks needed to guarantee a specific power (see Chapter 6.) Source SS )2 Treatments A SS A = b ∑i (Ȳi· − Ȳ·· Blocks B SSB = a ∑ j (Ȳ· j − Ȳ·· )2 SSE = ∑ij (Yij − Ȳi· − Ȳ· j + Ȳ·· Error SStotal = ∑(Yij − Ȳ·· Total )2 df MS F EMS ( a − 1) SS A ( a −1) SSB ( b −1) SSE ( a−1)(b−1) MS A MSE MSblocks MSE b ∑ α2i ( a −1) a ∑ β2 σ2 + (b−1i) σ2 ( b − 1) )2 ( a − 1)(b − 1) ab − 1 The hypothesis of interest here is H0 : α1 = . . . = α a = 0. If we find differences between the treatments, they are further investigated by lookind at specific contrasts (Chapter 4). 1. planned comparisons: t-tests and confidence intervals. 2. orthogonal contrasts 3. orthogonal polynomials it the treatments are ordered and equally spaced 4. unplanned comparisons Computing Formulae (∑ Yij )2 ab SS“tot” = = abȲ··2 = ∑ Yij2 − C SS A SSB = = b ∑ Ȳi2· − C a ∑ Ȳ·2j − C SSe = SS“tot” − SS A − SSB C Table 5.3: Analysis of Variance Table for the Randomised Block Design with model Yij = µ + αi + β j + eij σ2 + sta2005s: design and analysis of experiments 95 Estimates µ̂ α̂i β̂ j = = = Ȳ·· Ȳi· − Ȳ·· Ȳ· j − Ȳ·· Example: Timing of Nitrogen Fertilization for Wheat Current recommendations for nitrogen fertilisation were developed through the use of periodic stem tissue analysis for nitrate content of the plant. This was thought to be an effective means to monitor nitrogen content of the crop and a basis for predicting optimum production. However, stem nitrate tests were found to over-predict nitrogen amounts. Consequently the researcher wanted to evaluate the effect of several different fertilization timing schedules on the stem tissue nitrate amounts and wheat production to refine the recommendation procedure (Source: Kuehl 2000). The treatment structure included six different nitrogen application timing and rate schedules that were thought to provide the range of conditions necessary to evaluate the process. For comparison, a control treatment of no nitrogen was included as was the current standard recommendation. The experiment was conducted in an irrigated field with a water gradient along one direction of the experimental plot area as a result of irrigation. Since plant responses are affected by variability in the amount of available moisture, the field plots were grouped into blocks of six plots such that each block occurred in the same part of the water gradient. Thus, any differences in plant responses caused by the water gradient could be associated with the blocks. The resulting experimental design was a randomized (complete) block design with four blocks of six field plots to which the nitrogen treatments were randomly allocated. The layout of the experimental plots in the field is shown in Table 5.4. The observed nitrate content (ppm ×102 ) from a sample of wheat stems is shown for each plot along with the treatment numbers, which appear in the small box of each plot. Block 1 2 40.89 5 37.99 4 37.18 1 34.98 6 34.89 3 42.07 Irrigation Gradient Block 2 1 41.22 3 49.42 4 45.85 6 50.15 5 41.99 2 46.69 ⇓ Block 3 6 44.57 3 52.68 5 37.61 1 36.94 2 46.65 4 40.23 Block 4 2 41.90 4 39.20 6 43.29 5 40.45 3 42.91 1 39.97 Table 5.4: Observed nitrate content (ppm ×102 ) from samples of wheat stems from each plot. First row in each block indicates treatment number. sta2005s: design and analysis of experiments The linear model for this randomized block design is Yij = µ + αi + β j + eij where µ is the overall mean, αi is the nitrogen treatment effect, β j is the block effect, and eij is the experimental error assumed ∼ N (0, σ2 ). Treatment and block effects are assumed to be additive. Block Means: Treatment Means: ----------------------1 2 3 ---------------------------------------------- 4 control 2 3 4 5 6 44.03 46.77 40.62 39.51 43.23 38.00 45.89 43.11 41.29 38.28 ----------------------- ---------------------------------------------- ANOVA Table: ---------------------------------------------Df Sum Sq Mean Sq F value Pr(>F) TREATMNT 5 201.32 40.263 5.5917 0.004191 BLOCK 3 197.00 65.668 9.1198 0.001116 15 108.01 7.201 Residuals ---------------------------------------------- The blocked design will markedly improve the precision on the estimates of the treatment means if the reduction in SSE with blocking is substantial. The F statistic to test for differences among the treatment means is F = 5.59. The p-value is 0.004, suggesting differences between the nitrogen treatments with respect to stem nitrate. There is usually little interest in a formal inference about block effects, although we might be interested in whether blocking increased the efficiency of the design, which it did if F > 1. Treatment 4 was the standard fertilizer recommendation for wheat. We could now compare each of the treatments to treatment 4 to see if any differ from the current recommended treatment. The control gives a means of evaluating the nitrogen available without fertilization. 5.2 Missing values – unbalanced data Easy analysis of the Randomised Block Design depends on having an observation in each cell of the two-way table, i.e. each treatment appears once in each block. We call this a balanced design. Balanced designs ensure that block and treatment effects can be estimated independently. This greatly simplifies interpretation of results. 96 sta2005s: design and analysis of experiments More generally, data or designs are balanced when we have the same number of observations for all factor level combinations. Missing observations result in unbalanced data. What happens if some of the observations in our RBD experiment are missing? This could happen if an experimental unit runs away or explodes, or dies or becomes sick during the experiment and can no longer participate. Then we no longer have a balanced design. Refer back to the layout of the RBD (Table 5.1). If we have no missing observations, and we compare treatment 1 to treatment 2, on average, they don’t differ with respect to anything except for the treatment (exactly the same block contributions are made in each treatment, and there are no interactions, which means that the block effect is the same for each treatment). Now, what happens if one observation is missing? The problem is the same we have in regression, where coefficients are interpreted conditional on the values of all other variables in the model. There the variables are all more or less correlated. In such a case it is not entirely possible to extract the effect of a single predictor variable, the coefficient or effect depends on what other terms are in the model. The same happens when the data from an experiment become unbalanced. For example, the treatment i effect can no longer be estimated by Yi. − Y.. , which would give a biased estimate for αi . There are two strategies to deal with unbalanced data. The first is to estimate the value, substitute it back in, but reduce the error degrees of freedom accordingly. The advantage is that the data become balanced, and the results are as easy to interpret as before: we are exactly able to attribute variation caused by differences between treatments and variation caused by differences between blocks. The second strategy is to fit a regression model. The least squares estimates from this model will still give you the best possible estimate of the treatment effects, provided you have accounted for blocks, i.e. the blocking factor must be in the model. Sums of squares can’t be split exactly any more, but we would base our F-test for treatments on the change in variation explained relative to a full model except for the treatment term, i.e. change in variation explained when the treatment factor is added last. You don’t need to remember the formulae for estimating the missing values. You would get the same value when fitting a regression model and from that obtain the estimate for the missing value; you should know how to go about this in practice (both strategies). 1. In the case of only one or two observations missing, one could estimate the value of the missing treatment, based on the other 97 sta2005s: design and analysis of experiments observations. The error degrees of freedom are reduced accordingly, by the number of estimated observations. 1 2 .. . i .. . a Treatment 1 - 2 - Blocks ··· j .. . - - Yij Treatment Totals ··· - . B′ Block Totals T′ .. - b - G ′ ( N = ab) Suppose observation Yij is missing. Let u be our estimate of the missing observation. The least squares estimate of the observation Yij would be u = µ̂ + α̂i + β̂ j Let T ′ be the sum of the (b − 1) observations on the ith treatment. Let B′ be the sum of the ( a − 1) observations on the jth block. Let G ′ be the sum of the N − 1 observations on the whole experiment. Then µ̂ α̂i β̂ j = = = G ′ +u N T ′ +u b B′ +u a − − G ′ +u N G ′ +u N So u = = G ′ +u N T ′ +u b ′ ′ + T b+u − G N+u + ′ ′ + B a+u − G N+u B′ +u a − G ′ +u N Hence u= aT ′ +bB′ − G ′ (b−1)( a−1) The estimate of the missing value is a linear combination of the other observations. It can be shown that it is the value u which minimizes the SSE when ordinary ANOVA is carried out on the N data points (the ( N − 1) actual observations and u). Since the missing value is a linear combination of the other observations it follows that Ȳi· - the ith treatment is correlated with the other means. If there is a missing observation on the ith treatment it can be shown that the variance of the estimated difference between treatment i and any other, i′ is 98 sta2005s: design and analysis of experiments Var (Ȳi· − Ȳi′ · ) = a2 h 2 b + a b(b−1)( a−1) i If there are 2 missing values we can repeat the procedure above and solve the simultaneous equations u1 u2 = µ̂ + α̂i + β̂ j = µ̂ + α̂i′ + β̂ j′ One degree of freedom is subtracted from the error degrees of freedom for each missing value estimated. Thus the degrees of freedom of s2 (MSE) are ( a − 1)(b − 1) − k, where k is the number of missing values. The degrees of freedom of the F tests are adjusted accordingly. 2. Alternatively, one can estimate the parameters using a linear regression model. But because treatments and blocks are no longer orthogonal (independent), the order in which the terms enter the model will become important, and interpretation may become more difficult. The estimates obtained from fitting a regression model are ‘last one in’ estimates. This means they estimate the change in response after all other variables in the model have explained what they can, i.e. variation in residuals. So are the t-tests. If we want to conduct an ANOVA table in a similar way (last one in) we cannot use R’s aov function, which calculates sequential SS. Sequential ANOVA tables test change in variance explained when adding each term, given all previous terms in the model. The SSs and F-tests will thus change and give different results depending on the order in which the terms appear in the model. The Anova function in R’s car package uses Type II sums of squares, i.e. calculating SSs as last-one-in, as the regression t-tests do: each SS is calculated as change in SS explained compared to SS explained given all other terms in the model. 5.3 Randomization Tests for Randomized Block Designs Example: Boys’ Shoes Measurements were made of the amount of wear of the soles of the shoes worn by 10 boys. The shoe soles were made by two different synthetic materials, A and B. Each boy wore a pair of special shoes, the sole of one shoe having been made with A and the sole of the other with B. The decision as to whether the left or right sole was made with A or B was made with a flip of a coin. The following table gives the data. It illustrates that A showed less wear than B, since ( B − A) > 0 most generally. 99 sta2005s: design and analysis of experiments sideA Material B sideB Difference d = B − A L 14.0 R 0.8 L 8.8 R 0.6 R 11.2 L 0.3 L 14.2 R -0.1 R 11.8 L 1.1 L 6.4 R -0.2 L 9.8 R 0.3 L 11.3 R 0.5 R 9.3 L 0.5 L 13.6 R 0.3 ¯ Average difference d = 0.41 Since material A was standard, and B a cheaper substitute, we wished to test whether B resulted in increased wear. This implies H0 : µ A = µ B (no difference in wear between materials A and B) H1 : µ A < mu B (increased wear with B) Boy 1 2 3 4 5 6 7 8 9 10 Material A 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 For matched pairs we instead write the hypotheses as follows: H0 : µ D = 0 (no difference in wear between materials A and B) H1 : µ D < 0 (increased wear with B) where µ D is the mean difference in wear calculated as: wear with material A - wear with material B. The observed sequence of tosses leading to the above treatment allocations was (head implies A worn on right foot): T T H T H T T T H T Under H0 , A and B are merely labels and could be swopped without making a difference. Hence boy 1 could have worn A on the right and B on the left foot, resulting in a difference of B − A = 13.4 − 14 = −0.8. Similarly, boy 6 could have worn A on the right and B on the left foot, giving B − A = 6.6 − 6.4 = +0.2. This implies the actual values of wear, and hence the values of the differences do not change, but the signs associated with these differences do. The given sequence of coin tosses is one of 210 = 1024 equally probable outcomes: There exist 2 orderings for each pair, the 10 pairs are independent, hence there are 2 × 2 × 2 × 2 . . . = 210 different orderings. To test H0 , the actually observed difference of 0.41 may be compared with all possible 1024 average differences that could have occurred as a result of different outcomes of coin tosses. To obtain these 1024 average differences we need to average the differences for all possible combinations of + and - signs. This is hard work! So lets think how can we obtain average differences greater than the observed 0.41: Only when the positive differences stay the same and one or both of the negative differences become positive (since the 2 negative differences were associated with the smallest 100 sta2005s: design and analysis of experiments absolute values!). This implies 3 possible differences > 0.41. Four further samples give values of d¯ = 0.41. This implies a p-value of 7 1024 = 0.007. Questions 1. What are your conclusions about the shoe sole materials? 2. What parametric test could be used for the above example? What are its assumptions about the data? 3. How does its p-value compare to the one obtained above? 4. Give the basic idea of randomization tests, how do they work? 5.4 Friedman Test The Friedman test is a non-parametric test to test for for differences between treatments (k ≥ 3), in a randomized block design (several related samples). It is the non-parametric equivalent of the F-test in a two-way ANOVA. This is an extension of matched pairs to more than two samples. The test can be used on on ranked data. The null and alternative hypotheses are: H0 : Populations are identical H1 : Populations are not identical (at least one treatment tends to yield larger values than at least one other treatment). The test statistic is calculated as follows: • rank observations within each block • Let R j = sum of ranks in column (treatment) j, ( j = 1, . . . , k ) R j = ∑ib=1 R( Xij ) T 12 = bk(k + 1) 12 = bk(k + 1) k ∑ b ( k + 1) Ri − 2 2 k ∑ R2i − 3b(k + 1) If ties are present, the above statistic T needs to be adjusted. In that case one can use the sums of squares of the ranks to calculate the following statistic: T2 = MStreatment ∼ Fk−1,(b−1)(k−1) MSE approximately. This is always a one-sided test. Critical Values 101 sta2005s: design and analysis of experiments 102 • For small b and k use the Friedman Table. • For large b and k: T ≈ χ2k−1 Example Six quality control laboratories are asked to analyze 5 chemicals to see if they are performing analyses in the same manner. Determine whether any of the labs are different from any others if the ranked data are as follows. Chemical A B C D E Ri Lab 1 1 3 3 1 4 12 2 5 2 4 4 5 20 3 3 1 2 6 1 13 4 2 4 5 3 2 16 5 4 6 1 2 3 16 6 6 5 6 5 6 28 H0 : All labs identical H1 : Labs not identical k = 6, b = 5 T= h i 12 122 + 202 + 132 + 162 + 162 + 282 − 3 × 5 × 7 = 9.8 5×6×7 Using T ∼ χ25 we obtain a p-value = 0.08. There is little evidence to indicate that the labs are performing the analyses differently. 5.5 The Latin Square Design A Latin Square of order p is an arrangement of p letters each repeated p times into a square of side p so that each letter appears exactly once in each row and once in each column (a bit like Sudoku). Two Latin Squares of order 2 AB BA BA AB Three Latin Squares of order 3 ‘Latin’ because the treatments are usually denoted by the Latin letters A, B, C, .... sta2005s: design and analysis of experiments ABC BCA CAB CAB BCA ABC BAC CBA ACB A Latin Square can be changed into another one of the same order by interchanging rows and columns. The Latin Square is a useful way of blocking for 2 factors each with p levels without increasing the number of treatments. The rows of the square denote one blocking factor and the columns the other. The entries are p treatments which are to be compared. We require only p2 experimental units. If we had made an observation on each combination of 2 blocking factors and p treatments we would have needed p3 experimental units. Model for a Latin Square Design Experiment Let Yijk = observation on the kth treatment in the ith row and jth column of the square. Then a suitable model is Yijk = µ + αiR + βCj + γkT + eijk with identifiability constraints ∑ i α i = ∑ j β j = ∑ k γk = 0 where µ αi βj γk eijk eijk ∼ N (0, σ2 ) general mean ith row effect jth column effect kth treatment effect random error of observation and are independent Note that we cannot write i = 1, . . . , p, j = 1, . . . , p, and k = 1, . . . , p, because not all the triplets i, j and k appear in the experiment. So we write {ijk} ∈ D where D is the set of all triplets appearing. When calculating, the set D is obvious - it is only the derivation that is awkward notationally. We put the subscripts in brackets to denote that we sum over the ijk’s actually present. To obtain the least squares estimates we minimize S = ∑ijk (Yijk − µ − αi − β j − γk )2 103 sta2005s: design and analysis of experiments ∂S ∂µ = 0 gives ∑{ijk}∈ D Yijk = p2 µ since there are p2 observations. µ̂ = Ȳ··· = ∑{ijk} ∂S ∂γk Yijk p2 = −2 ∑ij (Yijk − µ − αi − β j − γk ) = 0 Using the constraints we find pµ + pγk γ̂k = Y··k = Ȳ··k − Ȳ··· Similarly α̂i β̂ j = Ȳi·· − Ȳ··· = Ȳ· j· − Ȳ··· The error is found by substituting µ̂ , α̂i , β̂ j and γ̂k in SSE to give SSresidual = ∑{ijk}∈ D (Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· )2 and σ̂2 = SSresidual ( p − 1)( p − 2) Test of the hypothesis H0 : γ1 = γ2 = . . . = γ p = 0 As in the other cases, we could derive a likelihood ratio test for H0 , by fitting two models - one which contains the γ’s and one that does not. We shall use the short method. Consider Yijk − µ̂ = α̂i + β̂ j + γ̂k + êijk (Yijk − Ȳ··· ) = (Ȳi·· − Ȳ··· ) + (Ȳ· j· − Ȳ··· ) + (Ȳ··k − Ȳ··· ) +(Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· ) Squaring and summing over the p2 (ijk)’s present in the design gives ∑(ijk) (Yijk − Ȳ··· )2 = p ∑(i) (Ȳi·· − Ȳ··· )2 + p ∑( j) (Ȳ· j· − Ȳ··· )2 + p ∑(k) (Ȳ··k − Ȳ··· )2 + ∑(ijk) (Yijk − Ȳi·· − Ȳ· j· − Ȳ··k + 2Ȳ··· )2 104 sta2005s: design and analysis of experiments and so SStot = SScol + SSrows + SStreatment + SSE with df p2 − 1 = ( p − 1) + ( p − 1) + ( p − 1) + ( p − 1)( p − 2) As in previous examples, the SS’s are independent and when divided by σ2 each have a χ2 distribution with appropriate degrees of freedom. The ANOVA table for the Latin square design is: SS Source Rows Columns Treatment Error SSrow SScol SStreat SSe Total SS“tot” )2 = p ∑(i) (Ȳi·· − Ȳ··· = p ∑( j) (Ȳ· j· − Ȳ··· )2 = p ∑(k) (Ȳ··k − Ȳ··· )2 = ∑(ijk) (Ȳijk − Ȳi·· −Ȳ· j· − Ȳ··k + 2Ȳ··· )2 = ∑(ijk) (Yijk − Ȳ··· )2 df ( p − 1) ( p − 1) ( p − 1) ( p − 1)( p − 2) MS F MStreat MSe MStreat MSe p2 − 1 Usually only treatments are tested but row and column differences can be tested in the usual way. The power of the F test can be found using the non-central F with ( p − 1) and ( p − 1)( p − 2) df and non-centrality parameter λ = p∑ γk2 σ2 Disadvantages of Latin Squares: Latin squares can only be used when the number of treatments equals the number of rows equals the number of columns. The model assumes no interactions between rows, columns and treatments. If interactions are present between the rows and columns then the treatment and error sums of squares are biased. Other uses of Latin Squares The Latin Square can also be used in factorial experiments for experimenting with 3 factors each with p levels using only ( 1p )th of the complete design. For example a 53 factorial experiment has 125 treatments. If we are willing to assume that there are no interactions we can arrange the 3 factors in a Latin Square with rows for one factor, columns for another and letters for the third. We then use only 25 treatments instead of 125, an 80% reduction. 105 sta2005s: design and analysis of experiments 106 Missing Values The analysis of unbalanced data from Latin Square Designs follows along the same lines as for Randomized Block Designs (see Section 5.2). Suppose the Y value corresponding to the ith row, jth column and kth letter is missing. It can be estimated by u= pRi′ + pCj′ + pTk′ − 2G ′ ( p − 1)( p − 2) where Ri′ is the ith row sum without the missing value Ci′ is the jth column sum without the missing value Ti′ is the kth treatment sum without the missing value G ′ is the total sum without the missing value Example: Rocket Propellant Suppose that an experimenter is studying the effects of five different formulations of a rocket propellant used in aircrew escape systems on the observed burning rate. Each formulation is mixed from a batch of raw material that is only large enough for five formulations to be tested. Furthermore, the formulations are prepared by several operators, and there may be substantial differences in the skills and experience of the operators. Source: Montgomery DC. Design and Analysis of Experiments. A Latin square is the right design to use here, because we only have enough material for 5 replicates of every treatment (formulation), but want to block for batch and for operator. Each operator prepares one of each formulation, and each batch is used to prepare one of each formulation. Batches of Raw Material 1 2 3 4 5 batch operator formulation Residuals Df 4 4 4 12 1 A = 24 B = 17 C = 18 D = 26 E = 22 Sum Sq 68.00 150.00 330.00 128.00 2 B = 20 C = 24 D = 38 E = 31 A = 30 Mean Sq 17.00 37.50 82.50 10.67 Operator 3 C = 19 D = 30 E = 26 A = 26 B = 20 4 D = 24 E = 27 A = 27 B = 23 C = 29 F value 1.59 3.52 7.73 Pr(>F) 0.2391 0.0404 0.0025 5 E = 24 A = 36 B = 21 C = 22 D = 31 Table 5.5: Rocket propellant data and layout. Table 5.6: ANOVA table for rocket propellant data. sta2005s: design and analysis of experiments 107 With Latin square designs we have to assume an additive model (no interactions between the factors). The ANOVA table suggests that there are differences between the types of formulation in terms of burn rate (p = 0.0025), and that there are differences between operators (p = 0.04). There is no indication that burn rate differs between batches. The next question of interest will be which formulations have the highest burning rates. We have no a-priori hypotheses on these, so a common approach is to compare all formulations pairwise, and use Tukey’s method to adjust p-values and confidence intervals. Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = rate ~ batch + operator + propellant, data = rocket) $propellant diff lwr upr p adj B-A -8.4 -14.9839317 -1.8160683 0.0110827 C-A -6.2 -12.7839317 0.3839317 0.0684350 D-A 1.2 -5.3839317 7.7839317 0.9754380 E-A -2.6 -9.1839317 3.9839317 0.7194121 C-B 2.2 -4.3839317 8.7839317 0.8204614 D-B 9.6 3.0160683 16.1839317 0.0041583 E-B 5.8 -0.7839317 12.3839317 0.0944061 D-C 7.4 0.8160683 13.9839317 0.0254304 E-C 3.6 -2.9839317 10.1839317 0.4461852 E-D -3.8 -10.3839317 a b c a b 2.7839317 0.3966727 a c 95% family−wise confidence level c B−A linear predictor C−A D−A 30 E−A C−B 25 D−B E−B D−C 20 E−C E−D A B C formulation D E −15 −10 −5 0 5 10 15 Differences in mean levels of formulation The compact letter display plot (cld in library multcomp) nicely summarises this: treatments with the same letter are not significantly different. Formulation D had the largest mean burning rate, but we can’t be sure that this is higher than for formulations A Figure 5.3: Boxplots of burning rate against formulation for rocket propellant data. On the right are confidence intervals for pairwise differences between means, adjusted using Tukey’s method. sta2005s: design and analysis of experiments and E, however the data seem to suggest that its burning rate is higher than for formulations B and C. Can we replicate the p-values and confidence intervals from R? Let’s try the last line (formulations E and D): µ D − µ E = 3.8 SE(mean) = √ 10.67/5 = 1.460822 p − value = P(q5,12 ≥ 3.8/1.460822) = 0.3968151 using ptukey(3.8 / 1.460822, nmeans = 5, df = 12, lower.tail = FALSE) q5,12 = 4.50771 using qtukey(0.95, 5, 12) CI: 3.8 ± 1.460822 ∗ 4.50771 = [−2.784962, 10.38496] The values agree to within 2-3 decimal places. That is fine, the difference is because the values we have taken from the above output are rounded, whereas R has used the exact values. Blocking for 3 factors - Graeco-Latin Squares This section is just added out of interest. Blocking for three factors, e.g. cars, wheel positions and drivers, can be achieved by using a Graeco-Latin Square. A Graeco-Latin Square is formed by taking a Latin Square and superimposing a second square with the treatments in Greek letters. For example A B C D B A D C C D A B D C B A and α γ δ β β δ γ α γ α β δ δ β α γ gives Aα Bγ Cδ Dβ Bβ Aδ Dγ Cα Cγ Dα Aβ Bδ Dδ Cβ Bα Aγ If the two squares have the property that each Greek letter coincides with each Latin letter exactly once then the squares are called orthogonal. Complete sets of ( p − 1) mutually orthogonal Graeco-Latin Squares exist whenever p is a prime or a power of a prime. No orthogonal Graeco-Latin Square exists for p=6. 108 6 Power and Sample Size in Experimental Design 6.1 Introduction An important part of planning an experiment is deciding on the number of replicates required so that you will have enough power to detect differences if these are there. Experiments are very time consuming and expensive and it is always worthwhile to invest a little time calculating required sample sizes. On the other hand these calculations may show that you will need too many replications, for which you don’t have the money, and it will tell you that it is better not to start the experiment at all because it is doomed to fail. Although this is often referred to as sample size calculations, in experiments we are not really talking about samples, but the number of replicates needed per treatment. Questions: 1. How can an experiment fail if the sample size was too small? 2. What is statistical power, in your own words? 3. Which are the 3 key ingredients that will determine statistical power (in the experimental setting)? Basically, the smaller the differences (effects) are that you want to detect, the larger the sample sizes will have to be! Power is defined as: 1 − β = Pr [reject | H0 false] = Pr [ F > Fa−1,N −a | H0 false] sta2005s: design and analysis of experiments 110 To calculate power we need the distribution of F if H0 is false. Recall the Expected Mean Squares for treatment. If H0 is false, the test statistic F0 = MStreat MSE has a noncentral F distribution with a − 1 and N − a degrees of freedom (in the case of a CRD) and noncentrality parameter a λ = r ∑(µi − µ̄)2 /σ2 i where r denotes the number of replicates per treatment. If λ = 0 we have a central F-distribution. The power can be calculated for any given r. Power is often chosen to lie around 0.8 - 0.9. An estimate for σ2 can come from knowledge you have or from a pilot study. Rather than having to specify the size of all effects it will be easier to specify the smallest difference between any two treatment means that would be physically meaningful. Suppose we want to detect a significant difference if any two treatment means differ by D = µi − µ j . With a larger non-centrality parameter Pr [ F > c] increases, and power increases. So we want to ensure the smallest λ in a given situation will lead to a rejection. This will ensure that the power is at least as specified. The minimum λ when there is a difference of at least D will then be λ= rD2 2σ2 Where did the noncentral Fdistribution come from? Firstly, a noncentral chi-squared distribution results from where Xi ∼ N (µi , σ2 ). ∑in=1 Xi2 , This noncentral chi-squared distribution will have noncentrality parameter λ = ∑ µ2i . If the null hypothesis of equal treatment means is false, the treatment effects don’t have mean 0 but mean αi and the treatment SS will have a non-central chi-squared distribution. The non-central F-distribution arises as the ratio of a non-central chisquared distribution and a chi-squared distribution. Example: In a study to compare the effects of 4 diets on the weight of mice of 20 days old, the experimenter wishes to detect a difference of 10 grams. The experimenter estimates that the standard deviation σ is no larger than 5 grams. How many replicates are necessary to have a probability of 0.9 to detect a difference of 10 grams using the F test with significance level α = 0.05? A difference of 10 grams means that the maximum and the minimum treatment means differ by 10 grams. The noncentrality parameter is smallest when the other two treatment means are all in the middle, i.e. the four treatment means are a, a + 5, a + 5, a + 10 for some constant a. Here is some R code for the above example, followed by the output. ---------------------------------------------------------------------------a <- 4 D <- 10 sta2005s: design and analysis of experiments sigma <- 5 alpha <- 0.05 df1 <- a - 1 for (r in 2:9) { df2 <- a * (r - 1) ncp <- r * D^2 / (2 * sigma^2) fcrit <- qf(alpha, df1, df2, lower.tail = FALSE) # this is the critical value of the # F-distribution under H0 power <- 1 - pf(fcrit, df1, df2, ncp) cat(r, power, ncp, "\n") } ---------------------------------------------------------------------------r power ncp 2 0.1698028 4 3 0.3390584 6 4 0.503705 8 5 0.6442332 10 6 0.7545861 12 7 0.8361289 14 8 0.8935978 16 9 0.9325774 18 ----------------------------- ncp is the non-centrality parameter. r = 9 replicates will give us a power > 0.90. 6.2 Two-way ANOVA model Consider factor A with a levels, and a second factor B (a blocking or a treatment factor) with b levels. The non-centrality parameter will be a λ = b ∑ (µi − µ̄)2 /σ2 i =1 and power for detecting differences between the levels of A can be calculated similarly to above, except that the degrees of freedom in the F-tests will change. These notes refer to power for the special case of the ANOVA F-test. In general we would need to know the distribution under the alternative hypothesis in order to calculate power. 111 sta2005s: design and analysis of experiments References http://www.stat.purdue.edu/~zhanghao/STAT514/handout/ chapter03/PowerSampleSize.pdf 112 7 Factorial Experiments 7.1 Introduction Up to now we have developed methods to compare a number of treatments. We have thought of the treatments as having no special relationships among themselves and they each influence the response Y independently. Any other factors which may have influenced Y were removed by blocking or randomisation so that we could make more sensitive comparisons of the treatments. In the language of experimental designs, these are called single factor experiments - the individual treatments are the levels of the factor. There are many situations where the behaviour of the response Y cannot be understood by looking at factors one at a time. We need to consider the influence of several factors acting simultaneously. For example: 1. The yield of a crop might depend on the amount of nitrogen and the amount of potassium in the fertilizer. 2. The response to the drug may depend on the age of the patient and the severity of his illness. 3. The yield of a chemical compound may depend on the pressure and the temperature at which the chemical reaction takes place. Factorial experiments allow us to evaluate the effect of each factor on its own and to study the effect of a number of them working together or interacting. Example 1: Bond Strength Three types of adhesive (glue) are being tested in an adhesive assembly of glass specimens. A tensile test is performed to determine the bond strength of the glass to glass assembly. Three sta2005s: design and analysis of experiments different types of assembly (cross-lap, square-centre and round-centre) are tested. The following table shows the bond strength on 45 specimens. Adhesive 047 Cross-lap 16 14 19 18 19 Assembly Square-Centre 17 23 20 16 14 00T 23 18 21 20 21 24 20 12 21 17 24 21 25 29 24 001 27 28 14 26 17 14 26 14 28 27 17 18 13 16 18 Adhesive (Factor A) Assembly (Factor B) Response (Y) No. of Levels 3 3 Round-Centre 13 19 14 17 21 1 2 3 047 00T 001 Cross Square Round Bond Strength Model Yijk = µ + αiA + β Bj + (αβ)ijAB + eijk ∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0 i j 7.2 Basic Definitions Consider an experiment where a response Y is measured a number of times. Each measurement is called a trial. 1. Factor Any feature of the experiment that can be changed from trial to trial is called a factor. Factors can be qualitative or quantitative. 114 sta2005s: design and analysis of experiments • Examples of qualitative factors are: colour, sex, social class, severity of disease, residential area. Strictly speaking, sex and social class are not treatment factors. However, one may still be interested in differences (between sexes, say). In this case one analyses such data exactly as a factorial experiment. However, interpretation can only be about association, not causality. • Examples of quantitative factors are: temperature, pressure, age, income. Factors are denoted by capital letters, e.g. A, B, C etc.. 2. Levels The various values of the factor examined in the experiment are called its levels. Suppose temperature (T) is a factor in the experiment. Then the levels of T might be chosen as 0◦ C, 10◦ C, 20◦ C. If Colour is a factor C then the levels of C might be Red, Green, Blue. Sometimes the levels of a quantitative factor are treated qualitatively, e.g. the levels of temperature T are cold, warm and hot. The levels of a factor are denoted by subscripts: T1 , T2 , T3 are the levels of factor T. 3. Treatment A combination of a single level from each factor in the experiment is called a treatment. Example: Suppose we wish to determine the effects of temperature (T) and Pressure (P) on the yield of Y of a chemical compound. If T has two levels 0◦ and 10◦ and P has three levels, Low, Medium and High, the treatments would be: 0◦ and Low pressure 0◦ and Medium pressure 0◦ and High pressure 10◦ and Low pressure 10◦ and Medium pressure 10◦ and High pressure There are 2 × 3 = 6 treatments on the experiment and a number of measurements on Y would be made on each of the treatments. 4. Effect of a factor The change in the response produced by a change in the level of the factor is called the effect of the factor. There are two types of effects, main effects and interaction effects. 5. A Main effect is the average change in the response produced by changing the level of a single factor. It is the average over all the levels of the other factors in the experiment. Thus in the experiment above, the main effect of a temperature of 0◦ would be the average change in yield of the compound averaged over the three pressures low, medium and high relative to the overall mean. All the effects we have looked at so far were main effects. 115 sta2005s: design and analysis of experiments 6. Interaction: If the effect of a factor depends on the level of another factor that is present then the two factors interact. For example, consider the amount of risk people are willing to take and the two factors gender and situation. Women might be willing to take high risks in one situation but very little in another, while for men this willingness might be directly opposite. So the response depends not only on situation or only on gender but one has to look at the particular combination of factors. Therefore, if interactions are present, it will not be meaningful to interpret the main effects, it is not very informative to know what women risk on average, one will have to look at the combination of gender and situation to understand the willingness to take risks. 7. Interaction effect: The interaction effect is the change in response (compared to the overall mean) over and above the main effects at a certain combination. 8. Fixed and Random effects: Effects can also be classified as Fixed or Random. Suppose the experiment were repeated a number of times. If the levels of the factors are the same each time the experiment is repeated then the effects are called fixed and the results only apply to the levels used in the experiments. If the levels of a factor are chosen at random from a population of levels each time the experiment is repeated, then the effects are called random. Example: If we are only interested in temperatures of 0◦ and 10◦ the temperature would be a fixed effect. If the experiment were repeated we would use 0◦ and 10◦ again. If we were interested in the range of temperatures from 0◦ to 20◦ say, and each time we ran the experiment we decided on two Temperatures at random, then temperature would be a random effect. The arithmetic of the analysis of variances is exactly the same for both fixed and random effects but the interpretations of the results, the expected mean squares and tests are completely different. We shall deal mainly with fixed effects. 9. In a complete factorial experiment every combination of factor levels is studied and the number of treatments is the product of the number of levels of each factor. Example: If we examine the effect of 3 factors A, B and C on response Y and A has two levels, B has 3 and C has 5 then we have 2 × 3 × 5 = 30 treatments. The design is called a 2 × 3 × 5 factorial design. 116 sta2005s: design and analysis of experiments 117 7.3 Design of Factorial Experiments Factorial refers to the treatment structure. A factorial experiment is an experiment in which we have at least 2 treatment factors, and the treatments are constructed by crossing the treatment factors, i.e. the treatments are constructed by having every possible combination of factor levels (mathematically, the Cartesian product). At least for full factorial experiments. There are also fractional factorial experiments where some of the treatment combinations are left out, by design, but we will not deal with those in this course. Very often, more than one factor affects the response. For example, in a chemical experiment, temperature and pressure affect yield. In an agricultural experiment, nitrogen and phosphate in the soil affect yield. In a sports health experiment, improvement in fitness may not only depend on the physical training program, but also on the type of motivation offered. Example: Effect of selling price and type of promotional campaign on number of items sold price (R) Effect of selling price (R55, R60, R65) and type of promotional campaign (radio, newspaper, website pop-ups) on the number of products sold (new type of cell-phone contract). There are two treatment factors (price and type of promotion). If we are going to use a factorial treatment structure, there are 3 × 3 = 9 treatments. The experimental units could be different towns. 65 x 65 60 x 60 55 x 55 radio web newsp type of campaign x radio x web x newsp type of campaign 65 x x x 60 x x x 55 x x x radio web newsp type of campaign If we only experiment with one factor at a time, we need to keep all other factors constant. But in this way, we can never find out whether factor A would have influenced the response differently at another level of B, i.e. we cannot look at interactions (if we did two separate experiments to repeat all levels of A at another level of B, B and time would be confounded). Figure 7.1: Illustration of different ways of investigation two treatment factors. The first two figures are two one-factor-at-a-time experiments. The figure on the RHS illustrates a factorial experiment. sta2005s: design and analysis of experiments 118 Interactions We have encountered interactions in the RBD chapter. There we specifically assumed NO interactions. Interactions are often the most interesting parts of experiments, and with factorial experiments we can investigate them. Factors A and B are said to interact if the effects of factor A depend on the level of factor B (or the other way around). A and B interact ● b1 ● b2 ● Y Y no interaction ● b1 ● b2 ● ● Figure 7.2: Two different scenarios of response in a factorial experiment. On the LHS factors A and B do not interact. On the RHS, factors A and B do interact. Y is the response. A is a factor with 2 levels (a1 and a2), B is a factor with levels b1 and b2. The points indicate the mean response at a certain treatment, or factor level combination. ● a1 a2 a1 a2 Consider Figure 7.2. This is a factorial experiment (although it could also be illustrating results in a RBD, where B, with levels b1 and b2, would denote a blocking factor instead of a second treatment factor). Remember, what we mean by effect: the change in mean response relative to some baseline level. In ANOVA models, effect usually refers to change in mean response relative to overall mean. But, to understand interactions, I am here going to use effect as the change in mean response relative to the baseline level (first level of the factor). In the left-hand plot, when changing from a1 to a2 at level b1, the mean response increases by a certain amount. When changing from a1 to a2 at level b2, that change in mean response is exactly the same. In other words the effect of A (when changing from a1 to a2) on the mean response does not depend on the level of B, i.e. A and B do not interact. In the right-hand plot, when changing from a1 to a2 at level b1, the mean response increases, when changing from a1 to a2 at level b2, the mean response decreases: the effect of A depends on the level of B, i.e. A and B interact. If there is no interaction, the lines will be approximately parallel (remember random variation). Note that interaction (between A and B) and independence of A and B are two completely different things/concepts! We have designed the experiment so that A and B are independent! Yet they can Constructing interaction plots: 1.) If one of the variables is continuous (even if only measured at a few discrete points), this should go on the x-axis. 2.) Putting the variable with more levels on the x-axis makes the plot easier to interpret. sta2005s: design and analysis of experiments interact. The interaction refers to what happens to the response at a certain combination of factor levels, not whether A and B are correlated or not. Interactions are interesting because they will indicate particularly good or bad combinations and how one factor effects the response in the presence of another factor, and very often the response of one factor depends on what else you manipulate or keep constant. x x x 60 x x x 55 x x x radio web newsp type of campaign 65 x x x 60 x x x 55 x x x radio web newsp type of campaign price (R) 65 price (R) price (R) If interactions are suspected to be present, factorial experiments are much more efficient than one-factor-at-a-time experiments. Consider again the campaign example above. If I investigated first the one factor, then in a second experiment only the second factor, I would need at least 12 experimental units (2 × 3 + 2 × 3, 2 replicates per treatment), and probably twice the amount of time. On the other hand, I could run a factorial experiment all at once, with a minimum of 9 experimental units which would allow me to estimate all main effects. I would need a minimum of 18 experimental units (two replicates for each treatment) to also estimate interaction effects. 65 x x x 60 x x x 55 x x x radio web newsp type of campaign In a factorial experiment one can estimate main effects even if treatments are not replicated. See Figure 7.3. On the LHS, I can estimate the average response (number of items sold) with a web campaign. This will give me the main effect of web campaign when compared to the overall mean, i.e. on average, what happens with web campaign. In other words, the main effects measure the average change in response with the particular level, averaged over all levels of the other factors, averaged over all levels of price in this example. Similarly, I can estimate the main effect of price R55, by taking the average response with R55 over all levels of campaign type. This is sometimes called hidden replication: even though the treatments are not replicated, the levels of each factor are. Can I estimate the interaction effects when there treatments are not replicated? In an a × b factorial experiment, there are a × b interaction effects, one for every treatment. The interaction effect measures how different the mean response is relative to the sum of the main effects (µ + αi + β j ). 119 sta2005s: design and analysis of experiments 120 Consider the RHS plot in Figure 7.3, and the typical model for a factorial experiment with 2 treatment factors: Yijk = µ + αi + β j + (αβ)ij + eijk In order to estimate the interaction effect (αβ)ij , we need to compare the mean response at this treatment to µ + αi + β j (the sum of the main effects). But there is only one observation here, and we need this observation to estimate eijk , i.e. the experimental unit and interaction effect are confounded here. The only solution to this is to have replication at that level. For example, if we want to estimate the interaction effect of newspaper campaign with a price of R65, we need to have replication at newspaper and R65 (and every other campaign x price treatment). One always needs to be able to estimate an error term (experimental unit effect). If there is only one observation per treatment, we need to assume that the effect of newspaper is the same for all price levels. Then we can estimate an average (main) effect for newspaper. But we cannot test or establish whether the effect of newspaper is different in the different price levels. If I want to estimate the effect of a particular combination of factor levels, over and above the average effects (i.e. the interaction effect), then I need replication at that combination of treatment levels. Sums of Squares SSerror ab(n − 1) SStotal abn − 1 Figure 7.3: Breakdown of total sum of squares in a completely randomized factorial experiment, with two treatment factors. SS AB ( a − 1)(b − 1) SStreatment ab − 1 SSB b−1 SS A a−1 To understand the sums of squares and the corresponding degrees of freedom, think in terms of the design that was used. For example, Figure 7.3 shows the break-down of the total sum of squares in a completely randomized factorial experiment. Exactly sta2005s: design and analysis of experiments like in a CRD, the total sum of squares is split into error and treatment sums of squares. There are a × b treatments, thus ab − 1 treatment degrees of freedom. There are abn experimental units in total (ab treatments, each replicated n times), thus abn − 1 error degrees of freedom. Sums of squares for main effects are as before, and the interaction degrees of freedom is just the rest (or again the typical cross-tabulation degrees of freedom that you have come across in the RBD and in the chi-squared test for independence). The treatment mean is calculated as the average of the n observations for treatment ij, as before, and the interaction effect is estimated as ([ αβ)ij = Ȳij. − (µ̂ + α̂i + β̂ j ) = Ȳij. − Ȳi.. − Ȳ.j. + Ȳ... 7.4 The Design of Factorial Experiments Note that ‘factorial experiment’ refers to the treatment structure and is not one of the basic experimental designs. Factorial experiments can be conducted as any of the 3 designs we have seen in earlier chapters. Replication and Randomisation 1. To get a proper estimate of σ2 more than one observation must be taken on each treatment - i.e. we must have replication. In factorial experiments a replication is a replication of all treatments (all factor level combinations). 2. The same number of units should be assigned to each treatment combination. The total sum of squares can then be uniquely split into components associated with the main effects of the factors and their interactions. The sums of squares are independent. This allows us to assess the effect of factor A, say, independently of factor B, etc.. If unequal numbers of units are assigned to the treatments, the simplicity of the ANOVA and interpretation breaks down. There is no longer a unique split of the total sums of squares into independent sums of squares associated with each factor. The sum of squares for factor A will depend upon whether or not factor B has been fitted. The conclusions that can be drawn are not as clear as they are with a balanced design. Unbalanced designs are difficult to analyse. 3. Factorial designs can generate a large number of treatments. If factor A has 2 levels, B has 3 and C has 4, then there are 24 treatments. If three replications are made then 72 experimental units will be needed. If there are not sufficient homogeneous 121 sta2005s: design and analysis of experiments experimental units, they can be grouped into blocks. Each replication could be made in a different block. Incomplete factorial designs are available, in which only a carefully selected number of treatments are used. See any of the recommended texts for details. Why are factorial experiments better than experimenting with one factor at a time? Consider the following simple example: Suppose the yield of a chemical process depends on 2 factors: (T) - the temperature at which the reaction takes place and (P) - the pressure at which the reaction takes place. Suppose and T has 2 levels T1 and T2 P has 2 levels P1 and P2 Suppose we experiment with one factor at a time. We would need at least 3 observations to give us information on both factors. Thus we would observe Y at T1 P1 and T2 P1 which would measure a change in temperature only because the pressure is kept constant. If we then observed Y at T1 P2 we could measure the effect of change on pressure only. Thus we have the following design Change pressure only ↓ Change temperature only −→ T1 P1 T2 P1 T1 P2 The results of the experiment could be tabulated as: P1 P2 T1 (1) (3) T2 (2) where (1) represents the observation. The effect of change of temperature is given by (2) - (1). The effect of change of pressure is given by (3) - (1). To find an estimate of the experimental error we must duplicate all (1),(2) and (3). We then measure the effects of the factors by the appropriate averages of the readings and also estimate σ2 by the standard deviation. Hence for effects and error estimates we need at least 6 readings. For a factorial experiment with the above treatments, we consider every treatment combination. P1 P2 T1 T2 (1) (2) (3) (4) 122 sta2005s: design and analysis of experiments i. The effect of change of temperature at P1 is given by (2) - (1). ii. The effect of change of temperature at P2 is given by (4) - (3). iii. The effect of change of pressure at T1 given by (3) - (1). iv. The effect of change of pressure at T2 given by (4) - (2). If there is no interaction, i.e. the effect of changing temperature does not depend on the pressure level, then the estimates (i.) and (ii.) only differ by experimental error and their average gives the effect of temperature just as precisely as the duplicate observation in (1) and (2) we needed in the one factor experiment. The same is true for the pressure effect. Hence, if there is no interaction of factors, we can obtain as much information with 4 observations in a factorial experiment as we did with 6 observations varying only one at a time. This is because all 4 observations are used to measure each effect in the factorial experiment whereas in the one factor at a time experiment only 23 of the observations are used to estimate each effect. Suppose the factors interact. If we experiment with one factor at a time we have the situation as shown above. We see that T1 P2 and T2 P1 give higher yields than T1 P1 . Could we assume that T2 P2 would be better than both? This would be true if factors didn’t interact. If they interact then T2 P2 might be very much better than both T1 P2 and T2 P1 or it may be very much worse. The “one factor at a time" experiments do not tell us this because we do not experiment at the most favourable (or least favourable) combination. Performing Factorial Experiments Suppose we investigate the effects of 2 factors, A and B where A has a levels and B has b levels. The a × b treatments are arranged in a factorial design and the design is replicated n times. The a × b × n experimental units are assigned to the treatments as in the completely randomised design with n units assigned to each treatment. Layout of the factorial experiment A B A1 A2 .. . Aa B1 x̃¯ x̃¯ .. . x̃¯ B2 x̃¯ x̃¯ .. . x̃¯ ... ... ... .. . ... Bb x̃¯ x̃¯ .. . x̃¯ 123 sta2005s: design and analysis of experiments ¯ ˜ x = = = observations at 1st replicate observations at 2nd replicate observations at 3rd replicate etc The entire design should be completed, then a second replicate model made, etc. This is relatively easy to do in agricultural experiments, since the replicates would be made simultaneously on different plots of land. In chemical, industrial or psychological experiments there is a tendency for a treatment combination to be set up and a number of observations made, before passing on to the next treatment. This is not a replication and if the experiment is analysed as though it were, the estimate of the error variance will be too small. Performed this way the observations within a treatment are correlated and a different analysis should be used. See Winer (pg. 391–394) for details. 7.5 Interaction Consider the true means B1 Factor A µij µ· j µi · µ·· = = = = B2 Factor B . . . Bj . . . A1 A2 .. . Ai .. . .. . .. . Aa Bb µij µi · ↑ ↑ (2) (1) ↓ ↓ µ· j µ·· true mean of the (ij)th treatment combination true mean of jth level of B true mean of ith level of A true overall mean Then Main effect of ith level of A Effect of ith level of A at jth level of B = µi· − µ·· (7.1) = µij − µ· j (7.2) If there is no interaction Ai will have the same effect at each level of B. If there is interaction it can be measured by the difference of (8.2) - (8.1), 124 sta2005s: design and analysis of experiments µij − µ· j − µi· + µ·· = (αβ)ij (7.3) The same formula for the Ai Bj interaction would be found if we started with the main effect for the jth level of B (µij − µ·· ) and compared it with the effect of Bj at the ith level of A, µij − µi· . From equation (5.3) we see that the interaction involves every cell in the table, so if any cells are empty (i.e. no observations on that treatment) the interaction cannot be found. In practice we have a random error as well. Replacing the true means in (5.3) by their sample means we estimate the interaction by Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· 7.6 (7.4) Interpretation of results of factorial experiments No interaction Very rarely, if we a-priori do not expect the presence of any interactions, we would fit the following model: Yijk = µ + αi + β j + eijk Interpretation is very simple. No interaction means that factors act on the response independently of each other. Apart from random variation, the difference between observations corresponding to any level of A is the same for all levels of B and vice versa. The main effects summarise the whole experiment. Interaction present Most often, interactions are one of the main points of interest in factorial experiments, and we would fit the following model: Yijk = µ + αi + β j + (αβ)ij + eijk The plots of the means of B at each level of A may interweave or diverge since the mean of B depends on what level of A is present. If interactions are present, main effects estimate the average effect (averaged over all levels of the other factor). For example if α1 > α2 , we must say that averaged over B α1 > α2 but for some levels of B, 125 sta2005s: design and analysis of experiments it may happen that α1 < α2 . Interaction plots are very useful when interpreting results in the presence of interaction effects. These plots can give a good indication as to patterns and could be used to answer some of the following or similar questions. • What is the best/worst combination? • How does the effect of A change with increasing levels of B? • Which levels are always better than other levels, regardless of the level of B? • What is the effect of A at low/medium/high levels of B? Some of these questions should be part of your a-priori hypothesis. In statistical reports, however, plots are not enough. For a report on any of the questions, we would rephrase them in the form of a contrast and give a confidence interval to back up our statement. Sometimes a large interaction indicates that a non-linear relationship between response and treatment factors. In this case a non-linear transformation of the response variable (e.g. a log-transformation) may produce a model with no interactions. Such transformations are called power transformations. Power transformations These are of the form Z Z = y λ−1 λ ̸= 0 = log(y) λ = 0 λ Special cases of these include the square root transformation or the log transformation. A log transformation of the observations means that we really have a multiplicative model Yijk = (eαi ) e β j (eeijk ) instead of the linear model Yijk = µ + αi + β j + eijk . If the data are transformed for analysis, then all inferences such as mean differences and confidence intervals are calculated from the transformed values. After all these quantities the results are transformed and expressed in terms of the original data. The value of λ has to be found by trial and error or can be estimated by maximum likelihood. It can sometimes make the experiment difficult to interpret. Alternatively, if the interaction is very large, the data can be analysed as a one-way layouts with (nb) observations per treatment or as a completely randomised design with (ab) treatments with n observations per treatment. When 126 sta2005s: design and analysis of experiments experimenting with more than 2 factors higher order interactions may be present, for example a 3 factor interaction or 4 factor interaction. Higher order interactions are difficult to interpret. A direct interpretation in terms of interactions is rarely enlightening. A good discussion of principles to follow when higher order interactions are present is given by Cox (1984). If higher order interactions are present he recommends attempting one or more of the following approaches: 1. transformation of the response; 2. fitting a non-linear model rather than a linear model; 3. abandoning the factorial representations of the treatments in favour of a few possibly distinctive factor combinations. This will in effect be grouping together certain cells in the table. 4. Splitting the factor combinations on the bases of one or more factors, e.g. considering AB for each level of C. 5. Adopting a new system of factors for the description of the treatment combinations. 7.7 Analysis of a 2-factor experiment Let Yijk be the kth observation on the (ij)th treatment combination. The full model is: Yijk = µ + αi + β j + (αβ)ij + eijk ∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0 i i j k µ αi βj (αβ)ij = = = = j = 1, . . . , a = 1, . . . , b = 1, . . . , n general mean main effect of the ith level of A main effect of the jth level of B interaction between the ith level of A and the jth level of B. Note that (αβ) is a single symbol and does not mean that the interaction is a product of the two main effects. The “sum to zero" constraints are defined as part of the model, and ensure that the parameter estimates subject to these constraints are 127 sta2005s: design and analysis of experiments unique. Other commonly used constraints are the “corner-point " constraints α1 = β 1 = αβ 1j = αβ i1 = 0. Again these estimates are unique subject to these constraints, but different from those given by the “sum to zero" constraints. The maximum likelihood/least squares estimates are found by minimizing S= ∑(Yijk − µ − αi − β j − (αβ)ij )2 (7.5) ijk Differentiating with respect to each of the (ab + a + b +1) parameters and setting the derivatives equal to zero gives ∂S ∂µ ∂S ∂αi ∂S ∂β i ∂S ∂(αβ)ij = = = = −2 ∑ijk (Yijk − µ − αi − β j − (αβ)ij ) −2 ∑ jk (Yijk − µ − αi − β j − (αβ)ij ) −2 ∑ik (Yijk − µ − αi − β j − (αβ)ij ) −2 ∑k (Yijk − µ − αi − β j − (αβ)ij ) = = = = 0 0 i = 1, . . . , a 0 j = 1, . . . , b 0 i = 1, . . . , a j = 1, . . . , b Using the side conditions the normal equations are abnµ bnµ + bnαi anµ + anβ j nµ + nαi + nβ j + n(αβ)ij = = = = Y··· Yi·· Y· j· Yij· i j i j = 1, . . . , a = 1, . . . , b = 1, . . . , a = 1, . . . , b The solutions to these equations are the least squares estimates µ̂ α̂i β̂ j ˆ ) (αβ ij = = = = Ȳ··· Ȳi·· − Ȳ··· Ȳ· j· − Ȳ··· Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· i = 1, . . . , a j = 1, . . . , b i = 1, . . . , a j = 1, . . . , b An unbiased estimator of σ2 is given by s2 = ˆ )ij )2 ∑ijk (Yijk − µ̂ − α̂i − β̂ j − (αβ ab(n − 1) = ∑ijk (Yijk − Ȳij· )2 ab(n − 1) (7.6) Note that s2 is obtained by pooling the within cell variances and could be written as 128 sta2005s: design and analysis of experiments s 2 = ∑ij (n − 1)s2ij ab(n − 1) where s2ij is the estimated variance in the (ij)th cell. 7.8 Testing Hypotheses When interpreting an ANOVA table, one should always start with the highest order interactions. If strong interaction effects are present, interpreting main effects needs to consider this. For example if there is no evidence for main effects of factor A, this DOES NOT mean that factor A does not affect the response. • H AB : (αβ)ij = 0 for all i and j interact) (Factors A and B do not • H A : αi = 0 i = 1, . . . , a (Factor A has no effect) • HB : β j = 0 j = 1, . . . , b (Factor B has no effect) The alternative hypothesis is, in each case, at least one of the parameters in H is non-zero. The F–test for each of these hypotheses effectively compares the full model to one of the three reduced models: 1. Yijk = µ + αi + β j + (αβ)ij + eijk 2. Yijk = µ + αi + β j + eijk which is called the additive model. 3. Yijk = µ + αi + eijk which omits effects due to B. 4. Yijk = µ + β j + eijk which omits effects due to A. The residual sum of squares from the full model provides the sum of squares for error s2 with ab(n − 1) degrees of freedom. Denote this sum of squares by SSE. To obtain the appropriate sums of squares for each of the three hypotheses, we could obtain the residual sums of squares from each of the models. To test H AB : (αβ)ij = 0 we find µ̂, α̂i and β̂ j to minimize S= ∑(Yijk − µ − αi − β j )2 ijk Equating to zero (7.7) 129 sta2005s: design and analysis of experiments ∂S = −2 ∑(Yijk − µ − αi − β j ) = 0 ∂µ ijk using the ‘sum to zero’ constraint gives µ̂ = Ȳ. From ∂S ∂αi = −2 ∑ jk (Yijk − µ − αi − β j ) = 0 for i = 1, . . . , a, we find α̂i = Ȳi·· − Ȳ··· , µ̂ + α̂i = Ȳi·· β̂ j = Ȳ· j· − Ȳ··· Note that the least squares estimates for µ, αi and β j are the same as under the full model. This is because the X’X matrix is orthogonal (or equivalently block-diagonal). This will not be the case if the numbers of observations per treatment differ (unbalanced designs). The residual sum of squares under H AB is SSres = ∑ ∑ ∑(Yijk − µ̂ − α̂i − β̂ j )2 i j (7.8) k The numerator sum of squares for the F test of H AB is given by the difference between (7.8) and the residual sum of squares from the full model. Regrouping the terms of (7.5) as ˆ )ij )2 ∑(Y| ijk − µ̂{z− α̂i − β̂}j − |(αβ {z } (7.9) ijk which can be written after squaring and summing as ˆ )2 ∑(Yijk − µ̂ − α̂i − β̂ j )2 − n ∑(αβ ij (7.10) ij ijk since the cross-product terms are zero in summation. From (7.8) and (7.10) we see that the numerator sum of squares for the F tests is SS AB ˆ )2 = n ∑(αβ ij ij = n ∑(Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· )2 ij and SS AB has (a-1)(b-1) degrees of freedom. Hence the F test of 130 sta2005s: design and analysis of experiments H AB0 : H AB1 : 131 All(αβ)ij = 0 versus At least one interaction is non-zero is made using MS AB ∼ F(a−1)(b−1),ab(n−1) MSE where MS AB = SS AB ( a − 1)(b − 1) and MSE = SSE ab(n − 1) Similar results can be derived for the test of H A and HB . Because of the orthogonality of the design, the estimates of the main effects under the reduced models are the same as under the full model. Hence we can split the total sum of squares about the grand mean SStotal = ∑(Yijk − µ̂)2 ijk = ∑(Yijk − Ȳ··· )2 ijk uniquely as SStotal = SS A + SSB + SS AB + SSe and the degrees of freedom as abn − 1 = ( a − 1) + (b − 1) + ( a − 1)(b − 1) + ab(n − 1) The results are summarised in an Analysis of Variance Table. Source SS df MS F Expected Mean Square A Main Effects SS A = nb ∑i (Ȳi·· − Ȳ··· )2 ( a − 1) MS A MS A MSE σ2 + nb a −1 ∑i α2i B Main Effects SSB = na ∑ j (Ȳ· j· − Ȳ··· )2 ( b − 1) MSB MSB MSE σ2 + na b −1 ∑ j β2j AB Interactions SS AB = n ∑ij (Ȳij· − Ȳi·· − Ȳ· j· + Ȳ··· )2 ( a − 1)(b − 1) MS AB MS AB MSE Error SSE = ∑ijk (Yijk − Ȳij· )2 ab(n − 1) MSE Total SStotal = ∑ijk (Yijk − Ȳ··· )2 abn − 1 σ2 + n ( a−1)(b−1) ∑ij (αβ)2ij σ2 Table 7.1: Analysis of variance table for two-factor factorial experiment sta2005s: design and analysis of experiments The expected mean squares are found by replacing the observations in the sums of squares by their expected values under the full model, dividing by the degrees of freedom and adding σ2 . For example SS A = nb ∑(Ȳi·· − Ȳ··· )2 i Now (Ȳi·· − Ȳ··· ) = α̂i and E(α̂i ) = αi since the least squares estimates are unbiased. Hence E( MS A ) = σ2 + nb α2i a−1 ∑ i 7.9 Power analysis and sample size: The non-centrality parameters for the F-tests are: HA : λ= nb ∑ α2i σ2 and the non-central F has (a-1) and ab(n-1) df HB : λ= na ∑ β2j σ2 and the non-central F has (b-1) and ab(n-1) df H AB : λ= n ∑ αβ2ij σ2 and the non-central F has (a-1)(b-1) and ab(n-1) df. 132 sta2005s: design and analysis of experiments The non-centrality parameters can be used to determine the number of replicates necessary to achieve a given power for certain specified configurations of the parameters. The error degrees of freedom is ab(n-1) where a = number of levels of A and b is the number of levels of B. As a rough rule of thumb, we should aim to have enough replicates to give about 20 degrees of freedom for error. In higher-way layouts, where some of the interactions may be zero, we can allow even fewer degrees of freedom for the error. In practise, the number of replications possible is often determined by the amount of experimental material, and the time and resources of the experimenter. Nonetheless, a few power calculations are helpful, especially if the F test fails to reject the null hypothesis. The reason for this could be due to the F test having insufficient power. 7.10 Multiple Comparisons for Factorial Experiments If the interactions were significant it makes sense to compare the ab cell means (treatment combinations). If no interactions were found, one can compare the levels of A and the levels of B separately. If the treatment levels are ordered, it is preferable to test effects using orthogonal polynomials. Both the main effects and the interactions can be decomposed into orthogonal polynomial contrasts. Questions: 1. On how many observations is each of the cell means based? What is the standard error for the difference between two cell means? 2. If we compare levels of factor A only, on how many observations are the means based? What is the standard error for a difference between two means now? 3. What are the degrees of freedom in the two cases above? 7.11 Higher Way Layouts For p factors we have ( 1p) main effect A, B, C etc. ( 2p) 2 factor interactions ( 3p) 3 factor interactions ( pp) = 1 p-factor interaction 133 sta2005s: design and analysis of experiments 134 If there are n > 1 observations per cell we can split SStotal into 2 p − 1 component SS’s and an error SS. If p > 4 or 5, factorial experiments are very difficult to carry out. 7.12 Examples Example 2 A small experiment has been carried out investigating the response of a particular crop to two nutrients, N and K. A completely randomised design was used with six treatments arranged in a 3 × 2 factorial structure. N was applied in 0, 4 and 8 units, whilst K was applied in 0 and 4 units. The yields were as follows: K4 N0 10.02 11.74 13.27 10.72 14.08 11.87 N4 20.65 18.88 16.92 19.33 20.77 21.70 N8 19.47 20.06 20.74 21.45 20.92 24.87 Df Sum Sq Mean Sq F value Pr(>F) N 2 298.19 149.09 57.84 0.0000 K 1 10.83 10.83 4.20 0.0629 N:K 2 2.49 1.24 0.48 0.6286 Residuals 12 30.93 2.58 There is no evidence for an interaction between N and K (p = 0.63). There is strong evidence that different N (nitrogen) levels affect yield (p < 0.0001), but only little evidence that yield differed with different levels of K (potassium) (p < 0.06). Table 7.2: ANOVA table for nitrogenpotassium factorial experiment. 22 K 2 1 20 mean of Yield K0 18 16 14 12 Note that the levels of nitrogen are equally spaced. As a next step we could fit orthogonal polynomials to see if the relationship between yield and nitrogen is quadratic (levels off), as perhaps suggested by the interaction plot. From the interaction plot it seems that perhaps the effect of K increases with increasing N, but the differences are too small to say anything with certainty (about K) from this experiment. 0 4 8 N Figure 7.4: Interaction plot for nitrogen and potassium factorial experiment. sta2005s: design and analysis of experiments Example 1: Bond Strength cont. Return to the bond strength example from the beginning of the chapter. Model Yijk = µ + αiA + β Bj + (αβ)ijAB + eijk ∑ αi = ∑ β j = ∑(αβ)ij = ∑(αβ)ij = 0 i j Analysis of Bond Strength of Glass-Glass Assembly Cell Means and Standard Deviations Adhesive Cross-lap Square-Centre Round-Centre 047 17.2 18.0 16.8 (2.2) (3.5) (3.3) 00T 20.6 18.8 24.6 (1.81) (4.6) (2.9) 001 22.4 21.8 16.4 (6.4) (7.1) (2.0) Cross-Lap and Square-Centre assembly with adhesive 001 appear to be more variable than the other treatments — but this is not significant: A modern robust test for the homogeneity of variances across groups is Levene’s test. It is based on absolute deviations from the group medians. It is available in R as leveneTest from package car. ----------------------------------------------------------Levene’s Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 8 1.0689 0.406 36 ----------------------------------------------------------- The null hypothesis is that all variances are equal, therefore there were no significant differences between the variances. Here is the ANOVA table: --------------------------------------------------Df Sum Sq Mean Sq F value Pr(>F) adhesive 2 127.51 63.756 3.6432 0.03623 assembly 2 4.98 2.489 0.1422 0.86791 adhesive:assembly 4 196.09 49.022 2.8013 0.04015 135 sta2005s: design and analysis of experiments Residuals 36 630.00 17.500 --------------------------------------------------- Adhesive and assembly method interact in their effects on strength (p = 0.04). So we look no further at the main effects but instead look at the interaction plots (Figure 7.5) to give us an idea of how these factors interact to influence strength. adhesive 24 24 assembly 18 20 22 001 00T 047 mean of strength 22 20 18 mean of strength square−centre cross−lap round−centre 001 00T 047 adhesive cross−lap round−centre square−centre assembly The round-centre assembly method works well with adhesive 00T (best of all combinations), but relatively poorly with the other two adhesives. The other two assembly methods seem to work best with adhesive 001, intermediate with 00T and worst with 047. We could now do some post-hoc tests (with corrections for multiple testing) to see for example whether the two assembly methods (square-centre and cross-lap) make any difference. Questions: 1. What experimental design do you think has been used? 2. Refer back to the ANOVA table, to the assembly line. Does the large p-value imply that assembly method has no effect on strength? With the help of the interaction plots briefly explain. Example 3 A hypothetical experiment from Dean and Voss (1999): In 1994 the Statistics Department at a university introduced a self-paced version of a data analysis course that is usually taught by lecture. Suppose the department is interested in student performance with each of the 2 methods, and also how student performance is affected by the particular instructor teaching the course. The students are randomly assigned to one of the six treatments. Figure 7.6 shows different hypothetical interaction plots that could result from this study. The y-axis shows student performance, the Figure 7.5: These interaction plots show (a) the mean bond strength for adhesive at different levels of assembly, and (b) the mean bond strength for assembly at different levels of adhesive. 136 sta2005s: design and analysis of experiments ● 1 2 3 ● instructor ● SP L 3.0 0.0 3.0 2.0 instructor 3 method ● ● ● 2 3 SP L 1.0 mean of M1 3 instructor 2 ● method 3.0 SP L 2.0 ● instructor method ● ● L SP ● ● 0.0 1.0 ● mean of M1 ● 1 3.0 ● ● 1 method 1.0 2.0 ● 1 2 instructor 3 ● method ● 0.0 0.0 ● SP L 2 instructor ● 3.0 3 mean of M1 2 instructor 2.0 3.0 method 0.0 2 ● 1 1.0 3.0 2.0 1.0 1 2.0 mean of M1 ● 0.0 ● 1.0 Student Performance L SP 0.0 mean of M1 ● ● method ● 1 mean of M1 ● 2.0 ● 1.0 ● ● 1.0 2.0 L SP mean of M1 3.0 method 0.0 mean of M1 x-axis shows instructor. In which of these do method and instructor interact? 3 ● 1 Instructor 2 instructor 3 SP L Figure 7.6: Possible configurations of effects present for two factors, Instructor and Teaching Method. 137 8 Some other experimental designs and their models 8.1 Fixed and Random effects So far we have assumed that the treatments used in an experiment are the only ones of interest. We aimed to estimate the treatment means and to compare differences between treatments. If the experiment were to be repeated, the same treatments would be used again. This means that each factor used in defining the treatments would have the same levels. When this is the case we say that the treatments or the factors defining them are fixed. The model is referred to as a fixed effects model. The simplest fixed effects model is the completely randomised design, or one-way lay-out, in which a treatments are compared. N experimental units are randomly assigned to the a treatments, usually with n subjects per treatment and the jth observation on the ith treatment has the structure Yij = µ + αi + eij i j ∑ αi (8.1) = 1, . . . , a = 1, . . . , n = 0 where µ αi eij = = = general mean effect of treatment i random error such that eij ∼ N (0, σ2 ) Example The Department of Clinical Chemistry was interested in comparing the measurements of cholesterol made by 4 different laboratories in the Western Cape. Since the Department was only interested in these four laboratories, if they decided to repeat the experiment, sta2005s: design and analysis of experiments they would send samples to the same four laboratories. Ten samples of blood were taken and each sample was divided into four parts, and one part was sent to each laboratory. The determination of the cholesterol levels were returned to the Department and the results analysed using a one-way analysis of variance (model (9.1)). Here, the parameter αi measured the effect of the ith laboratory. Significant differences between the αi ’s would mean that some laboratories tended to return, on average, higher values of cholesterol than others. Now consider the other situation. Suppose the Department believes that the determination of cholesterol levels varies from laboratory to laboratory about some mean value and they want to measure the amount of variation. Now they are not interested in any particular laboratory, so they select 4 laboratories at random from a list of all laboratories that perform such analyses, and send each laboratory ten samples as before. Now if this experiment were repeated, there is very little chance of the same four laboratories being used so, the effect of the laboratory is random. We now write a model for the jth determination from the ith laboratory as: Yij = µ + ai + eij (8.2) Here we assume that ai is a random variable such that E( ai ) = 0 and Var ( ai ) = σa2 , where σa2 measures the component in the variance of Yij that is due to laboratory. We also assume that ai is independent of eij . The term eij has E(eij ) = 0 and Var (eij ) = σe2 . Hence E(Yij ) Var (Yij ) = µ = σa2 + σe2 i j = 1, . . . , a = 1, . . . , b This is called the random effects model or the variance components model. To distinguish between a random or fixed factor, ask yourself this question: If the experiment were repeated, would I observe the same levels of A? If the answer is Yes, then A is fixed. If No, then A is random. Note that the means of the observations do not have a structure. In more complex situations it is possible to formulate models in which some effects are fixed and others are random. In this case a structure would be defined for the mean. For a two-way classification with A fixed and B random 139 sta2005s: design and analysis of experiments Yij = µ + αiA + b j + eij (8.3) where αi is a fixed effect of A, and b j is the value of a random variable with E(b j ) = 0 and Var (b j ) = α2b , due to the random effect of B. The model (9.3) is called a mixed effects model. 8.2 The Random Effects Model Assume that the levels of factor A are a random sample of size a from a large population of levels. Assume n observations are made on each level of A. Let Yij be the jth observation on the ith level of A, then Yij = µ + ai + eij where µ ai eij ai and eij is the general mean is a random variable with E( ai ) = 0 and Var ( ai ) = σa2 is a random variable with E(eij ) = 0 and Var (eij ) = σe2 are uncorrelated Further it is usually assumed that ai and eij are normally distributed. The Analysis of Variance table is set up as for the one-way fixed effects model. To test hypotheses about σa2 or to calculate confidence intervals, we need • an unbiased estimate of σe2 • an unbiased estimate of σa2 The fixed and random effects models are very similar and we can show that the fixed effects MSE provides an unbiased estimate of σe2 , i.e. E( MSE) = σ2 (the proof is left as an exercise). Does the fixed-effects mean square for treatments, MS A , provide an unbiased estimate for σa2 ? Not quite! However, we can show that MS A is an unbiased estimator for nσa2 + σe2 , i.e. E( MS A ) = nσa2 + σe2 . Then E( MS A − MSE ) = σa2 n NOTE: this estimator can be negative, even though σa2 cannot be zero. This will happen when MSE > MS A and is most likely to happen when σa2 is close to zero. If MSE is considerably greater than MS A , the model should be questioned. 140 sta2005s: design and analysis of experiments 141 Testing H0 : σa2 = 0 versus Ha : σa2 > 0 Can we use the same test as we used in the fixed effects model for testing the equality of treatment effects? If H0 is true, then σa2 =⇒ E( MS A ) =⇒ E( MS A ) E( MSE) = 0 = 0 + σe2 = σe2 = E( MSE) = 1 However, if σa2 is large, the expected value of the numerator is larger than the expected value of the denominator and the ratio should be large and positive, which is a similar situation to the fixed effects case. Source SS between groups within groups (error) total ∑i ni (Y i· − Y ·· )2 ∑i ∑ j (Yij − Y i· )2 ∑i ∑ j (Yij − Ȳ.. )2 Does that MS A MSe df Mean Square F EMS a−1 N−a N−1 SS A a −1 SSe N −a MS A MSe nσa2 + σe2 σ2 ∼ Fa−1;N −a under H0 ? We can show (see exercise below) SS A ∼ χ2a−1 nσa2 + σe2 and that SSE ∼ χ2N −a σe2 and that SS A and SSE are independent (Cochran’s Theorem). Under H0 , σa2 = 0 and MS A (nσa2 +σe2 ) MSe σe2 Under H1 , Exercise: MS A MSE Table 8.1: Anova for simple random effects model, with expected mean squares (EMS): = MS A MSE ∼ Fa−1,n− a has a non-central F distribution. sta2005s: design and analysis of experiments 1. Show that Cov(Yis , Yit ) = σa2 , where Yis and Yit are two observations in group i. 2. Use the above to show that (for the simple random effects model) 2 Var (Ȳi. ) = σa2 + σn . This implies that the observed variance i between group means does not directly estimate σa2 . 3. Hence show that of (Ȳi. − Ȳ.. )2 ] SS A nσa2 +σe2 ∼ χ2a−1 . [Hint: Consider the distribution Expected Mean Squares for the Random Effects Model SSE = ∑ ∑ Yij2 − ∑ ni Ȳi.2 i Now for any random variable X Var ( X ) = E( X 2 ) − [ E( X )]2 Then E[Yij2 ] = Var (Yij ) + [ E(Yij ]2 = σa2 + σe2 + µ2 Ȳi. = µ + ai + Var (Ȳi. ) E(SSE) 1 ni = σa2 + ∑ eij σ2 ni E(Ȳi. ) = µ E(Ȳi.2 ) = (σa2 + σ2 + µ2 ) ni σ2 = ∑ ∑(σa2 + σ2 + µ2 ) − ∑ ni (σa2 + ni = Nσ2 − νσ2 + µ2 ) i = ( N − ν ) σ2 E( MSE) = σ2 Above, N = ∑ ni , and ν is the number of random effect levels. SSA = ∑ ni Ȳi.2 − NȲ..2 142 sta2005s: design and analysis of experiments Ȳ.. = µ + E(Ȳ.. ) = µ, 1 N 1 ∑ ni ai + N ∑ ∑ eij Var (Ȳ.. ) = Var (Ȳi. ) = σa2 + E(Ȳi. ) = µ, E(SSA) = = E( MSA) N ∑ n2i 2 σ + 2 σ2 N2 a N σ2 ni # " σ2 ∑ n2i 2 σ2 2 2 σ + +µ + +µ )−N ∑ ni N N2 a i " # ∑ n2i N− σa2 + (ν − 1)σ2 N ni (σa2 = = cσa2 + σ2 Thus E MSA − MSE = σa2 c If all ni = n then c = n. Rather than testing whether or not the variance of the population of treatment effects is zero, one may want to test whether the variance is equal to (or less than) some proportion of the error variance, i.e, H0 : σa2 = θ0 σe2 , for some constant θ0 We can use the same F-statistic, but reject H0 if F > (1 + nθ0 ) Faα−1;n− a . Variance Components Usually estimation of the variance components are of greater interest than the tests. We have already shown that σ̂a2 = ( MSA − MSE) n and σ̂e2 = MSE 143 sta2005s: design and analysis of experiments An ice cream experiment To determine whether or not flavours of ice cream melt at different speeds, a random sample of three flavours were selected from a large population of flavours. The three flavours of ice cream were stored in the same freezer in similar-sized containers. For each observation one teaspoonful was taken from the freezer, transferred to a plate, and the melting time at room temperature was observed to the nearest second. Eleven observations were taken on each flavour: Flavour 1 2 3 Melting Times (sec) 24 1125 891 994 817 846 876 1075 982 960 1032 840 1150 1066 1041 889 844 848 1053 977 1135 967 841 848 1041 886 1019 838 785 832 1037 1093 823 Anova Table: Source Flavour Error Total df 2 30 32 SS 173 009.8788 203 456.1818 376 466.0306 MS 86 504.9394 6 781.8727 F 12.76 p 0.0001 An unbiased estimate for σ̂e2 = 6 781.8727 secs2 . An unbiased estimate for σa2 is σ̂a2 = MSA − MSE 86504.9394 − 6781.8727 = = 7247.5515 n 11 secs2 Ha : σa2 = 0 vs Ha > 0 can be tested using F = 12.76, p = 0.0001 where the p-value comes from the F2;30 distribution. In such an experiment there will be a lot of error variability in the data due to fluctuations of room temperature and the difficulty of determining the exact time at which the ice cream has melted completely. Hence variability in melting times of different flavours (σa2 ) is unlikely to be of interest unless it is larger than the error variance: H0 : σa2 ≤ σe2 vs H0 : σa2 > σe2 ≡ H0 : σa2 σe2 ≤ 1 vs H0 : σa2 σe2 >1. Again use F = 12.76 but compare with (1 + 11 × 1) F2;30 = 12F2;30 =⇒ there is no evidence that variation between flavours is larger than the error variance. 144 sta2005s: design and analysis of experiments 8.3 Nested Designs Nested designs are common in sampling designs, and less common for real experiments. However, many of the principles of analysis, and ANOVA still apply for such carefully designed studies. For example, the data are still balanced. Nested designs that are more like real experiments occur in animal breeding studies, e.g. where a bull is crossed with a number of cows, but the cows are nested in bull. Also in microbiology, where you can have daughter clones from mother clones (bacteria or fungi). Suppose a study is conducted to compare the domestic consumption of electricity in three cities. In each city three streets are selected at random and the annual amount of electricity used in three randomly selected houses in each street is recorded. This is a sampling design. H1 H2 H3 City 1 S1 S2 8876 9141 8745 9827 8601 10420 S3 9785 9585 9009 S1 9483 8461 8106 City 2 S2 S3 10049 9975 9720 11230 12080 10877 S1 9990 9218 9472 City 1 S2 S3 10023 10451 10777 11094 11839 10287 At first glance these data appear to be a 3-way cross-classification (analogous to a 3-factor factorial experiment) with factor cities (C); streets (S) and houses (H). However, note the crucial differences: Street 1 in city 1 is different from Street 1 in city 2, and from Street 1 in City 3. So even though they have been given the same label, they are in fact quite different. To be precise we should really label the streets as a factor with 9 levels, S1, S2, ... , S9 since there are 9 different streets. The same remarks apply to the houses. We say that the streets are nested in the cities, since to locate a street we must also state the cities and likewise, the houses are nested in the streets. We denote nested factors as S(C). The effects associated with the factor will have 2 subscripts, bij , where i denotes the level of C and j the levels Clearly we need another model. Since a factor S-Street is nested in the Cities, and the H-House is nested in the street. Also, since the streets and houses within each city were sampled, if the study were repeated the same houses and streets would not be selected again (assuming there is a large number of streets and houses in each city). So, both S and H are random factors. If Yijk is the amount of electricity consumed by the kth household in the jth street in the ith city. city Yijk = µ + αi house + bijstreet + eijk 145 sta2005s: design and analysis of experiments i j k = 1, 2, 3 = 1, 2, 3 = 1, 2, 3 where • αi is the fixed effect of the cities, ∑ αi = 0 • bij is the random effect of the jth street in city i • E(bij ) = 0, Var (bij ) = σb2 • eijk is the random effect of the kth house in the jth street in the ith city • E(eijk ) = 0, Var (eijk ) = σe2 • bij and eijk are independent. Note that: E(Yijk ) = µ + αi and Var (Yijk ) = σb2 + σe2 The aim of the analysis is: 1. To estimate the mean consumption in each city, and to compare the mean consumptions 2. To estimate the variance component due to streets and to households within streets. We assume that these components are constant over the three cities. Ȳ··· (Ȳi·· − Ȳ··· ) (Ȳij· − Ȳi·· ) (Ȳijk − Ȳij· ) estimates µ estimates αi 1=1, ..., a measures the contribution from the jth street in city i measures the contribution from the kth house in the jth street in the ith city We can construct expressions for the ANOVA table from the identity (Yijk − Ȳ··· ) = (Ȳi·· − Ȳ··· ) + (Ȳij· − Ȳi·· ) + (Ȳijk − Ȳij· ) Squaring and summing over ijk, the cross products vanish on summation and we find the sums of squares are ∑(Yijk − Ȳ··· )2 = nb ∑(Ȳi·· − Ȳ··· )2 + n ∑ ∑(Ȳij· − Ȳi·· )2 + ∑(Ȳijk − Ȳij· )2 ijk i i j ijk We denote these sums of squares as SStotal = SSC + SSS(C) + SSE and they have abn − 1 = ( a − 1) + a(b − 1) + ab(n − 1) degrees of freedom. 146 sta2005s: design and analysis of experiments Here we assume that there are a cities, b streets are sampled in each city and n houses in each street. Note that the last term SSE should strictly speaking be written as SS H (S(C)) . However, it is exactly the same expression as would be evaluated for an error sum of squares, so it is called SSE. We give the ANOVA table and state the values for the expected mean squares. A complete derivation is given in Scheffé (1959). Source SS df MS EMS SSC ( a −1) SSS(C) a ( b −1) SSE ab(n−1) σe2 + nσb2 + bn (a−1i ) Cities (Fixed) SSC (a-1) Streets (Random) SSS(C) a(b-1) Houses (Random) SSE ab(n-1) ∑ α2 σe2 + nσb2 σe2 To test: H0 : αi = αi = . . . = α a = 0 versus some H1 , some of or all of the α’s differ, we refer to the EMS column, and see that if H0 is true, ∑ α2i = 0. so E( MSC ) = σe2 + nσb2 = E( MSS(C) ) So the statistic to test H0 is MSC ∼ Fa−1,a(b−1) MSS(C) F= Note the denominator! ( MS − MSe ) S(C ) σb2 is estimated by n σe2 is estimated by MSe These are method of moment estimators. The maximum likelihood estimates can also be found. See Graybill. Calculation of the ANOVA table No special program is needed for a balanced nested design. Any program that will calculate a factorial ANOVA can be used. For our example we calculate the ANOVA table as though is were a three-way cross classification with factors C,S and H. The ANOVA table is: Source Cities S Streets H Houses CS CH SH CSH SS 488.4 1090.6 49.1 142.6 32.3 592.6 203.3 df 2 2 2 4 4 4 8 147 sta2005s: design and analysis of experiments The sum of squares for streets is SSS(C) SSS + SSSC 1090.6 + 142.6 with 2 + 4 = = = = 1233.3 6 df The sum of squares for houses within streets is SS H (SC) = = SS H + SSCH + SSSH + SSCSH 877.3 with 18 df So the ANOVA table is Source Cities Streets Houses SS 488.4 1233.3 877.3 df 2 6 18 MS 244.20 205.55 48.7 F 1.19 4.22 The F tests for differences between cities is MSC 244.2 = 1.19 = MSS(C) 205.5 This is distributed as F2,6 and is not significant. Conclude that there is no difference in mean consumption of electricity between two cities. To test H0 : σs2 = 0 use MSS(C) MSe = 205.5 = 4.22 ∼ F6,18 48.7 Reject H0 . To estimate σs2 use MSS(C) − MSe n Then σ̂s2 = 205 − 48.7 = 52.25 3 and σ̂e2 = 48.7 We note that the variation attributed to streets is about the same size as the variation attached to houses. Since there is no significant 148 sta2005s: design and analysis of experiments difference in mean consumption between cities, we estimate the mean consumption as Ȳ··· = 9896.85 Var (Ȳ··· ) = σs2 + σe2 52.25 + 48.7 = = 3.74 abn 3×3×3 For further reading on these designs see Dunn and Clark (1974), Applied Statistics: Analysis of Variance and Regression. The Nested Design Schematic Representation Factor B is nested in Factor A A has “a" levels. B has “b" levels. n observation taken on each B is A. Levels of A are fixed, levels of B are random (sampled). a=3 b=2 n=4 A1 B1 B2 Ȳ11· Ȳ12· Ȳ1·· A2 B1 B2 Y222 Ȳ21· Ȳ22· Ȳ2·· Ȳ··· A3 B1 B2 Ȳ31· Ȳ32· Ȳ3·· Yijk = kth observation at the jth level of B nested in the ith level of A Yijk = µ + αi + bij + eijk ∑ αi = 0 bij ∼ N (0, σB2 ) eijk ∼ N (0, σe2 ) bij and eijk independent for all ijk. Main effect of 2nd level of A is estimated by Ȳ2·· − Ȳ··· Main effect of the level of B nested in 2nd level of A is b22 estimated by Ȳ22· − Ȳ2·· The error = e222 is estimated by Y222 − Ȳ22· Factor B can also be regarded as fixed. Instead of estimating an overall variance for factor B, the contribution of the levels of B within each level A is of interest. The sum of squares for B (nested in A) is a b i j SSB( A) = n ∑ ∑(Yij· − Ȳi·· )2 149 sta2005s: design and analysis of experiments and has a(b-1) degrees of freedom. This can be split into “a" component sums of squares each with (b-1) degrees of freedom SSB( A) = SSB( A1 ) + SSB( A2 ) + . . . SSB( Aa ) where SSB( Ai ) = n ∑bj=1 (Ȳij − Ȳi·· )2 So tests of the levels of B nested in level i of A can be made using MSB( A ) i MSe Estimates of parameters of the nested design Parameter Estimate µ Ȳ··· µ + αi Ȳi·· αi Ȳi·· − Ȳ··· (α1 − α2 )etc Ȳ1·· − Ȳ2·· ∑ hi ( µ + αi ) σe2 2 σe + nσb2 ∑1a hi Ȳi·· s2 = MSe MSB( A) σb2 ( MSB( A) − MSe ) n Variance (σe2 +σb2 ) abn (σe2 +nσb2 ) bn (σe2 +nσb2 )( a−1) abn (σe2 +nσb2 ) bn (σe2 +nσb2 )(∑ h2i ) bn − − − 1. Confidence levels for linear combinations of the means can be found. The variance will be estimated by MSB( A) . 2. We have assumed that the levels of B are sampled from an infinite (or very large) “population" of levels. If there are not a large number of possible values of the levels of B, a correction factor is included in EMS for A. For example, if b levels are drawn from a possible K levels in the populations then E( MS A ) = σe2 + n(1 − Kn )σ2 + nb ∑ α2i a −1 . 8.4 Repeated Measures A form of experimentation often used in medical and psychological studies is one in which a number of subjects are measured on several occasions or undergo several different treatments. The aim of the experiment is to compare treatments or occasions. It is hoped that by using the same subjects on each treatment more sensitive comparisons can be made because variation will be reduced. The treatments are regarded as a fixed effect and the subjects are assumed to be sampled from a large population of subjects so we have again a mixed model. More complex experimental set-ups than the one described here are used, for details see Winer (1971). The general theory of balanced mixed models is given in Scheffé 150 sta2005s: design and analysis of experiments (1959). Other methods for such data are given on Hand and Crowder (1995). Repeated measures data, which is the term to describe the data from such experiments, can also be analysed by methods of multivariate analysis. However, for relatively small numbers of subjects, the ANOVA methods described here are useful. Example: A psychologist is studying memory retention. She takes 10 subjects and each subject is asked to learn 50 nonsense words. She then tests each subject 4 times: after 12, 36, 60 and 84 hours after learning the words. On each occasion she scores the subject’s performance. The data have the form: Times 1 Y11 . . . 1 . i 4 Subjects 2 ... j ... Y12 . . . . Yij . . 10 . . . Y4,10 where Yij = score of subject j at time i (i = 1 ... 4 ; j = 1 ... 10). Interest centres on comparing the recalls at each time. If the same subjects had not been tested each time, the design would have been a completely randomised design. However, it is not because of the repeated measurement on each subject. Let Yij = µ + αi + b j + eij i j = = 1, . . ., a 1, . . ., b where µ αi bj eij = = = = general mean effect of the ith occasion or treatment effect of the jth subject random error Assume E(b j ) = 0 and Var (b j ) = σb2 E(eij ) = 0 and Var (eij ) = σe2 and that b j and eij are independent From the formulation, we see that we have a 2-way design with one fixed effect (times) and one random effect (subjects). Formally it appears to be the same as a randomised block design with subjects 151 sta2005s: design and analysis of experiments forming the block, but there is one important difference: the times could not have been randomly assigned in the blocks. Thus we have, in our example with 10 subjects, exactly 10 experimental units receiving 4 treatments each. Strictly speaking with a randomised block design we would have had 40 experimental units, arranged in 10 blocks with four homogenous units in each. The units within a block would have been assigned at random to the treatments, giving 40 independent observations. The observations are independent both within each block and between blocks. With repeated measurement data, the four observations within a block all made on the same subject are possibly dependent, even if the treatments can be randomly assigned. Of course the observations on different subjects are independent. If the observations made on the same subject are correlated the data strictly speaking should be handled as a multivariate normal vector with mean µ and covariance matrix Σ. Tests of hypothesis about the mean vector of treatments can be made (Morrison). However, if the covariance matrix Σ has a pattern such that σij = σii + σjj − 2σij is constant for all ij, then the ANOVA approach used here is valid. It can also be shown that for a small number of subjects the ANOVA test has greater power. We proceed with the calculations in exactly the way as we did for the Randomised Block Design, and obtain the same ANOVA table (consult the earlier notes for the exact formulae and calculations). The ANOVA table is: Source df MS Expected Mean Square occasion (fixed) a-1 MSo σe2 + σo2 + subject (random) O×S b-1 (a-1)(b-1) MSsub MSo×sub 2 σe2 + σsub 2 σe2 + σos b ∑ α2 ( a −1) From the Expected Mean Squares column, we see that the hypothesis of H0 : No differences between occasions/treatments, or, α1 = α2 = . . . ... = α a = 0, can be tested using F= MSo ∼ F(a−1);(a−1)(b−1) MSo×sub The test is identical to that of the treatment in the randomised block design. For this and all other more complex repeated measurement designs, see Winer Chapter 4. For other methods of handling repeated measurement data see Hand and Crowder (1996). 152 sta2005s: design and analysis of experiments References 1. Hand, D and Crowder (1996). Practical Longitudinal Data Analysis. Chapman and Hall. Texts in Statistical Sciences 2. Winer, B.J. (1971). Statistical Principles in Experimental Design. McGraw Hill. Gives a detailed account of the ANOVA Approach to repeated measures. 153