Observational studies and Experimentation* The vast majority of sample surveys implemented nowadays may be classified under the broad heading of observational studies. Measurements are made on a range of variables across a range of individuals, (the sample). Given the values of the variables that happen to have been observed, relationships between variables will frequently emerge. In fact, in a sensibly planned survey, such relationships will be expected and the purpose of the survey is to quantify them. For example, in researching attitudes to and interest in using a range of new mobile services, interest may focus on whether individuals are likely to purchase one or other proposed new mobile services and, if so, how much individuals are likely to be prepared to pay for such services. Answers to such questions may influence decision making about introducing new products. Given a positive decision, when it comes to deciding on marketing strategies for such a new product, then the relationships of such variables with other variables, for example, income, age, leisure activities, that might be thought to influence them, will become important. However, there is an assumption being made here that the so-called response and explanatory variables are in a cause-effect relationship. This means that changes in the values of the explanatory variables, income, age, leisure activities etc., actually cause changes in the response variables, rather than being merely statistically / numerically related to the observed changes in the responses. Such assumptions, while essential if progress is to be made in business decision-making, are fallible in the absence of properly designed and controlled experiments. Recall the discussion of "Simpson's paradox" in section 7.4, pages 20-21, where a simplistic analysis suggested that there was a relationship between loan size and default rate but where the relationship disappeared when the influence of a third variable, loan type, was taken into account. While the paradox was resolved in that situation, because the so-called "lurking variable" had in fact been observed, there is no guarantee in an observational study that a "lurking variable" has not been overlooked and, therefore, no guarantee that an observed statistical relationship corresponds to cause and effect. It cannot be stressed enough that statistical relationships, of themselves, do not imply cause and effect. It must also be said that the temptation to infer cause and effect relationships using business judgement is fraught with danger. As one commentator put it: "The justification sometimes advanced that a multiple regression analysis on observational data can be relied upon if there is an adequate theoretical background is utterly specious and disregards the unlimited capability of the human intellect for producing plausible explanations by the carload lot." (K.A. Brownlee, Statistical Theory and Methodology in Science and Engineering, Wiley, 1965, page 454). To ensure as far as possible that an observed relationship does correspond to cause and effect requires a properly designed experiment in a properly controlled environment. The simplest form of experiment involves the studying the effect of a single change. In the experiment described in Section 1.9, the change was from one version of a process to another. A classic experiment in retail marketing is comparing sales of goods in large retail stores, using either midaisle or end-aisle displays. Drugs manufacturers conduct extensive trials to evaluate new drug treatments. The simplest of these involves comparison of a new treatment with a "placebo", that is, a version of the treatment that omits the active drug ingredient. In all these examples, two key ingredients are the factor (or design variable) changes in which may bring about a desired effect and the experimental unit to which the factor is applied. In the examples described, the factors (and their levels) were * from An Introduction to Statistical Analysis for Business and Industry by Michael Stuart. Section 11.4, pp. 356-359. Copyright © 2003. Page 2 process, (old or new), display location, (mid-aisle or end-aisle), drug, (present or absent), respectively. The experimental unit in the case of the process change experiment described in Section 1.9 was a working day, chosen primarily for convenience. It could be argued that shorter time periods could have been chosen as experimental units, with a view to minimising variation within units, thus allowing possible variation between units associated with the process change to be more evident. However, economic and logistical considerations conspire against this. In the retail marketing example, the experimental unit could be a time period such as a week, with the factor levels, mid-aisle and end-aisle display, alternating in successive pairs of weeks in a design similar to that used in the process change experiment. Ideally, the alternating pattern is chosen at random, with a view to reducing the effect of any other systematic pattern of variation that might be present without our knowledge. To achieve maximum homogeneity of experimental units, the experiment should be run entirely within a single store, necessarily over several weeks. Alternatively, in a chain of retail stores, a store could constitute an experimental unit with the entire experiment being run in one week using several stores. However, there are likely to be considerable differences between stores, other than a possible factor effect difference, thus making it difficult to detect a factor effect if present. It may be possible to reduce such differences by pairing stores with known similarities, then randomly allocating factor levels to stores within pairs to minimise the possibility of unknown systematic differences between stores affecting the result of the experiment1. In the drug testing example, the obvious experimental unit is the individual patient. The reason for administering an inactive form of the drug to those not receiving the active form is to allow for the well established placebo effect whereby, in many cases, patients show some improvement when they think they have received a treatment, even when the treatment is inactive. Thus the placebo is administered so that all patients think that they are receiving the treatment. However, there is another actor involved, the doctor who prescribes the treatment for the patient, and there is a major ethical problem in that the doctor is required to prescribe a placebo for some patients that might well benefit from the real treatment. Another problem is that, in evaluating the result, the doctor making the diagnosis may be influenced by knowledge of which patient got which treatment. To overcome these difficulties, treatments are allocated randomly to patients and identified with the patients' names in such a way that the doctor involved does not know which patient gets which treatment. Such experiments, in which neither patient nor doctor knows who got which treatment, are referred to as double blind experiments. They have the effect of sharing the ethical problem with the whole research team and avoiding possible doctor bias, either in prescription or subsequent evaluation. 1 There is a third possibility in which the effects of differences over time and differences between stores both can be minimised. This involves the use of what is called a latin square design. This is not pursued here. Details may be found in the Supplements and Extensions page of the book's website. Page 3 Randomised blocks These three apparently simple examples illustrate the care that is needed in designing satisfactory experiments so as to achieve the level of control necessary to be able to infer a cause and effect relationship; there are many more complex issues that arise in different circumstances. Two basic principles that emerge are those of blocking and randomisation. When there are known differences between experimental units, it makes sense to group the units into blocks2 that are as similar to each other (homogeneous) as possible. In the examples discussed above, blocks consist of pairs. If, in the process change experiment, there were six versions of the process to be compared, it would be sensible to form blocks of the six days in a week and apply each version of the process on one of the days. In order to minimise the effect of any other systematic pattern of variation that might exist unknown to the experimenter, versions of the process are allocated to days of the week at random. Replication This whole scheme may then be replicated, that is, repeated over a number of weeks. In theory, the number of replications is chosen so as to give desirable power, in the sense discussed in Section 5.3; the more replications, the better the chances of detecting real effects of the experimental changes. Replication also makes it more likely that random allocation of factor levels to experimental units will deliver the protection it promises. In practice, more often than not, the number of replications is determined by available resources. The randomised block experimental design is a basic form of design that may be elaborated on in many ways. The design allows valid comparisons between levels of the experimental factor within each block, which may be combined across blocks. In this way, systematic differences between blocks do not interfere in the assessment of factor effects, while randomisation gives some protection against unknown sources of systematic variation. 2 The term originated in agricultural experimentation where neighbouring experimental plots of land were grouped into relatively homogeneous blocks.