Experiment | Randomized experiment An experiment is an orderly procedure carried out with the goal of verifying, refuting, or establishing the validity of a hypothesis. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when a particular factor is manipulated. Experiments vary greatly in their goal and scale, but always rely on repeatable procedure and logical analysis of the results. There also exist natural experimental studies. A child may carry out basic experiments to understand the nature of gravity, while teams of scientists may take years of systematic investigation to advance the understanding of a phenomenon. Experiments and other types of hands-on activities are very important to student learning in the science classroom. Experiments can raise test scores and help a student become more engaged and interested in the material they are learning, especially when used over time. Experiments can vary from personal and informal natural comparisons (e.g. tasting a range of chocolates to find a favorite), to highly controlled (e.g. tests requiring complex apparatus overseen by many scientists that hope to discover information about subatomic particles). Uses of experiments vary considerably between the natural and human sciences. Experiments typically include controls, which are designed to minimize the effects of variables other than the single independent variable. This increases the reliability of the results, often through a comparison between control measurements and the other measurements. Scientific controls are a part of the scientific method. Ideally, all variables in an experiment will be controlled (accounted for by the control measurements) and none will be uncontrolled. In such an experiment, if all the controls work as expected, it is possible to conclude that the experiment is working as intended and that the results of the experiment are due to the effect of the variable being tested. Overview In the scientific method, an experiment is an empirical procedure that arbitrates between competing models or hypotheses. Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them. An experiment usually tests a hypothesis, which is an expectation about how a particular process or phenomenon works. However, an experiment may also aim to answer a "what-if" question, without a specific expectation about what the experiment will reveal, or to confirm prior results. If an experiment is carefully conducted, the results usually either support or disprove the hypothesis. According to some Philosophies of science, an experiment can never "prove" a hypothesis, it can only add support. Similarly, an experiment that provides a counterexample can disprove a theory or hypothesis. An experiment must also control the possible confounding factors—any factors that would mar the accuracy or repeatability of the experiment or the ability to interpret the results. Confounding is commonly eliminated through scientific controls and/or, in randomized experiments, through random assignment. 1 In engineering and other physical sciences, experiments are a primary component of the scientific method. They are used to test theories and hypotheses about how physical processes work under particular conditions (e.g., whether a particular engineering process can produce a desired chemical compound). Typically, experiments in these fields will focus on replication of identical procedures in hopes of producing identical results in each replication. Random assignment is uncommon. In medicine and the social sciences, the prevalence of experimental research varies widely across disciplines. When used, however, experiments typically follow the form of the clinical trial, where experimental units (usually individual human beings) are randomly assigned to a treatment or control condition where one or more outcomes are assessed. In contrast to norms in the physical sciences, the focus is typically on the average treatment effect (the difference in outcomes between the treatment and control groups) or another test statistic produced by the experiment. A single study will typically not involve replications of the experiment, but separate studies may be aggregated through systematic review and meta-analysis. Of course, these differences between experimental practice in each of the branches of science have exceptions. For example, agricultural research frequently uses randomized experiments (e.g., to test the comparative effectiveness of different fertilizers). Similarly, experimental economics often involves experimental tests of theorized human behaviors without relying on random assignment of individuals to treatment and control conditions. History Francis Bacon (1561–1626), an English philosopher and scientist active in the 17th century, became an early and influential supporter of experimental science. He disagreed with the method of answering scientific questions by deduction and described it as follows: "Having first determined the question according to his will, man then resorts to experience, and bending her to conformity with his placets, leads her about like a captive in a procession." Bacon wanted a method that relied on repeatable observations, or experiments. Notably, he first ordered the scientific method as we understand it today. There remains simple experience; which, if taken as it comes, is called accident, if sought for, experiment. The true method of experience first lights the candle [hypothesis], and then by means of the candle shows the way [arranges and delimits the experiment]; commencing as it does with experience duly ordered and digested, not bungling or erratic, and from it deducing axioms [theories], and from established axioms again new experiments. — Francis Bacon. Novum Organum. 1620. In the centuries that followed, people who applied the scientific method in different areas made important advances and discoveries. For example, Galileo Galilei (1564-1642) accurately measured time and experimented to make accurate measurements and conclusions about the speed of a falling body. Antoine Lavoisier (1743-1794), a French chemist, used experiment to describe new areas, such as combustion and biochemistry and to develop the 2 theory of conservation of mass (matter). Louis Pasteur (1822-1895) used the scientific method to disprove the prevailing theory of spontaneous generation and to develop the germ theory of disease. Because of the importance of controlling potentially confounding variables, the use of well-designed laboratory experiments is preferred when possible. A considerable amount of progress on the design and analysis of experiments occurred in the early 20th century, with contributions from statisticians such as Ronald Fisher (1890-1962), Jerzy Neyman (1894-1981), Oscar Kempthorne (1919-2000), Gertrude Mary Cox (19001978), and William Gemmell Cochran (1909-1980), among others. This early work has largely been synthesized under the label of the Rubin causal model, which formalizes earlier statistical approaches to the analysis of experiments. TYPES OF EXPERIMENT Experiments might be categorized according to a number of dimensions, depending upon professional norms and standards in different fields of study. In some disciplines (e.g., Psychology or Political Science), a 'true experiment' is a method of social research in which there are two kinds of variables. The independent variable is manipulated by the experimenter, and the dependent variable is measured. The signifying characteristic of a true experiment is that it randomly allocates the subjects in order to neutralize the potential for experimenter bias, and ensures, over a large number of iterations of the experiment, that all confounding factors are controlled for. CONTROLLED EXPERIMENTS A controlled experiment often compares the results obtained from experimental samples against control samples, which are practically identical to the experimental sample except for the one aspect whose effect is being tested (the independent variable). A good example would be a drug trial. The sample or group receiving the drug would be the experimental group (treatment group); and the one receiving the placebo or regular treatment would be the control one. In many laboratory experiments it is good practice to have several replicate samples for the test being performed and have both a positive control and a negative control. The results from replicate samples can often be averaged, or if one of the replicates is obviously inconsistent with the results from the other samples, it can be discarded as being the result of an experimental error (some step of the test procedure may have been mistakenly omitted for that sample). Most often, tests are done in duplicate or triplicate. A positive control is a procedure that is very similar to the actual experimental test but which is known from previous experience to give a positive result. A negative control is known to give a negative result. The positive control confirms that the basic conditions of the experiment were able to produce a positive result, even if none of the actual experimental samples produce a positive result. The negative control demonstrates the base-line result obtained when a test does not produce a measurable positive result. Most often the value of the negative control is treated as a "background" value to subtract from the test sample results. Sometimes the positive control takes the quadrant of a standard curve. 3 An example that is often used in teaching laboratories is a controlled protein assay. Students might be given a fluid sample containing an unknown (to the student) amount of protein. It is their job to correctly perform a controlled experiment in which they determine the concentration of protein in fluid sample (usually called the "unknown sample"). The teaching lab would be equipped with a protein standard solution with a known protein concentration. Students could make several positive control samples containing various dilutions of the protein standard. Negative control samples would contain all of the reagents for the protein assay but no protein. In this example, all samples are performed in duplicate. The assay is a colorimetric assay in which a spectrophotometer can measure the amount of protein in samples by detecting a colored complex formed by the interaction of protein molecules and molecules of an added dye. In the illustration, the results for the diluted test samples can be compared to the results of the standard curve (the blue line in the illustration) in order to determine an estimate of the amount of protein in the unknown sample. Controlled experiments can be performed when it is difficult to exactly control all the conditions in an experiment. In this case, the experiment begins by creating two or more sample groups that are probabilistically equivalent, which means that measurements of traits should be similar among the groups and that the groups should respond in the same manner if given the same treatment. This equivalency is determined by statistical methods that take into account the amount of variation between individuals and the number of individuals in each group. In fields such as microbiology and chemistry, where there is very little variation between individuals and the group size is easily in the millions, these statistical methods are often bypassed and simply splitting a solution into equal parts is assumed to produce identical sample groups. Once equivalent groups have been formed, the experimenter tries to treat them identically except for the one variable that he or she wishes to isolate. Human experimentation requires special safeguards against outside variables such as the placebo effect. Such experiments are generally double blind, meaning that neither the volunteer nor the researcher knows which individuals are in the control group or the experimental group until after all of the data have been collected. This ensures that any effects on the volunteer are due to the treatment itself and are not a response to the knowledge that he is being treated. In human experiments, researchers may give a subject (person) a stimulus that the subject responds to. The goal of the experiment is to measure the response to the stimulus by a test method. In the design of experiments, two or more "treatments" are applied to estimate the difference between the mean responses for the treatments. For example, an experiment on baking bread could estimate the difference in the responses associated with quantitative variables, such as the ratio of water to flour, and with qualitative variables, such as strains of yeast. Experimentation is the step in the scientific method that helps people decide between two or more competing explanations – or hypotheses. These hypotheses suggest reasons to explain a phenomenon, or predict the results of an action. An example might be the hypothesis that "if I release this ball, it will fall to the floor": this suggestion can then be tested by carrying out the experiment of letting go of the ball, and observing the results. Formally, a hypothesis is compared against its opposite or null hypothesis ("if I release this ball, it will not fall to the floor"). The null hypothesis is that there is no explanation or predictive power of the 4 phenomenon through the reasoning that is being investigated. Once hypotheses are defined, an experiment can be carried out - and the results analysed - in order to confirm, refute, or define the accuracy of the hypotheses. NATURAL EXPERIMENTS The term "experiment" usually implies a controlled experiment, but sometimes controlled experiments are prohibitively difficult or impossible. In this case researchers resort to natural experiments or quasi-experiments. Natural experiments rely solely on observations of the variables of the system under study, rather than manipulation of just one or a few variables as occurs in controlled experiments. To the degree possible, they attempt to collect data for the system in such a way that contribution from all variables can be determined, and where the effects of variation in certain variables remain approximately constant so that the effects of other variables can be discerned. The degree to which this is possible depends on the observed correlation between explanatory variables in the observed data. When these variables are not well correlated, natural experiments can approach the power of controlled experiments. Usually, however, there is some correlation between these variables, which reduces the reliability of natural experiments relative to what could be concluded if a controlled experiment were performed. Also, because natural experiments usually take place in uncontrolled environments, variables from undetected sources are neither measured nor held constant, and these may produce illusory correlations in variables under study. Much research in several important science disciplines, including economics, political science, geology, paleontology, ecology, meteorology, and astronomy, relies on quasiexperiments. For example, in astronomy it is clearly impossible, when testing the hypothesis "suns are collapsed clouds of hydrogen", to start out with a giant cloud of hydrogen, and then perform the experiment of waiting a few billion years for it to form a sun. However, by observing various clouds of hydrogen in various states of collapse, and other implications of the hypothesis (for example, the presence of various spectral emissions from the light of stars), we can collect data we require to support the hypothesis. An early example of this type of experiment was the first verification in the 17th century that light does not travel from place to place instantaneously, but instead has a measurable speed. Observation of the appearance of the moons of Jupiter were slightly delayed when Jupiter was farther from Earth, as opposed to when Jupiter was closer to Earth; and this phenomenon was used to demonstrate that the difference in the time of appearance of the moons was consistent with a measurable speed. 5 FIELD EXPERIMENTS Field experiments are so named in order to draw a contrast with laboratory experiments, which enforce scientific control by testing a hypothesis in the artificial and highly controlled setting of a laboratory. Often used in the social sciences, and especially in economic analyses of education and health interventions, field experiments have the advantage that outcomes are observed in a natural setting rather than in a contrived laboratory environment. For this reason, field experiments are sometimes seen as having higher external validity than laboratory experiments. However, like natural experiments, field experiments suffer from the possibility of contamination: experimental conditions can be controlled with more precision and certainty in the lab. Yet some phenomena (e.g., voter turnout in an election) cannot be easily studied in a laboratory. Contrast with observational study An observational study is used when it is impractical, unethical, cost-prohibitive (or otherwise inefficient) to fit a physical or social system into a laboratory setting, to completely control confounding factors, or to apply random assignment. It can also be used when confounding factors are either limited or known well enough to analyze the data in light of them (though this may be rare when social phenomena are under examination). In order for an observational science to be valid, confounding factors must be known and accounted for. In these situations, observational studies have value because they often suggest hypotheses that can be tested with randomized experiments or by collecting fresh data. Fundamentally, however, observational studies are not experiments. By definition, observational studies lack the manipulation required for Baconian experiments. In addition, observational studies (e.g., in biological or social systems) often involve variables that are difficult to quantify or control. Observational studies are limited because they lack the statistical properties of randomized experiments. In a randomized experiment, the method of randomization specified in the experimental protocol guides the statistical analysis, which is usually specified also by the experimental protocol. Without a statistical model that reflects an objective randomization, the statistical analysis relies on a subjective model. Inferences from subjective models are unreliable in theory and practice. In fact, there are several cases where carefully conducted observational studies consistently give wrong results, that is, where the results of the observational studies are inconsistent and also differ from the results of experiments. For example, epidemiological studies of colon cancer consistently show beneficial correlations with broccoli consumption, while experiments find no benefit. A particular problem with observational studies involving human subjects is the great difficulty attaining fair comparisons between treatments (or exposures), because such studies are prone to selection bias, and groups receiving different treatments (exposures) may differ greatly according to their covariates (age, height, weight, medications, exercise, nutritional 6 status, ethnicity, family medical history, etc.). In contrast, randomization implies that for each covariate, the mean for each group is expected to be the same. For any randomized trial, some variation from the mean is expected, of course, but the randomization ensures that the experimental groups have mean values that are close, due to the central limit theorem and Markov's inequality. With inadequate randomization or low sample size, the systematic variation in covariates between the treatment groups (or exposure groups) makes it difficult to separate the effect of the treatment (exposure) from the effects of the other covariates, most of which have not been measured. The mathematical models used to analyze such data must consider each differing covariate (if measured), and the results will not be meaningful if a covariate is neither randomized nor included in the model. To avoid conditions that render an experiment far less useful, physicians conducting medical trials, say for U.S. Food and Drug Administration approval, will quantify and randomize the covariates that can be identified. Researchers attempt to reduce the biases of observational studies with complicated statistical methods such as propensity score matching methods, which require large populations of subjects and extensive information on covariates. Outcomes are also quantified when possible (bone density, the amount of some cell or substance in the blood, physical strength or endurance, etc.) and not based on a subject's or a professional observer's opinion. In this way, the design of an observational study can render the results more objective and therefore, more convincing. Ethics By placing the distribution of the independent variable(s) under the control of the researcher, an experiment - particularly when it involves human subjects - introduces potential ethical considerations, such as balancing benefit and harm, fairly distributing interventions (e.g., treatments for a disease), and informed consent. For example in psychology or health care, it is unethical to provide a substandard treatment to patients. Therefore, ethical review boards are supposed to stop clinical trials and other experiments unless a new treatment is believed to offer benefits as good as current best practice. It is also generally unethical (and often illegal) to conduct randomized experiments on the effects of substandard or harmful treatments, such as the effects of ingesting arsenic on human health. To understand the effects of such exposures, scientists sometimes use observational studies to understand the effects of those factors. Even when experimental research does not directly involve human subjects, it may still present ethical concerns. For example, the nuclear bomb experiments conducted by the Manhattan Project implied the use of nuclear reactions to harm human beings even though the experiments did not directly involve any human subjects. 7 Design of experiments Design of experiments with full factorial design (left), response surface with second-degree polynomial (right) In general usage, design of experiments (DOE) or experimental design is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments. Formal planned experimentation is often used in evaluating physical objects, chemical formulations, structures, components, and materials. Other types of study, and their design, are discussed in the articles on computer experiments, opinion polls and statistical surveys (which are types of observational study), natural experiments and quasi-experiments (for example, quasi-experimental design). See Experiment for the distinction between these types of experiments or studies. In the design of experiments, the experimenter is often interested in the effect of some process or intervention (the "treatment") on some objects (the "experimental units"), which may be people, parts of people, groups of people, plants, animals, etc. Design of experiments is thus a discipline that has very broad application across all the natural and social sciences and engineering. History SYSTEMATIC CLINICAL TRIALS In 1747, while serving as surgeon on HMS Salisbury, James Lind carried out a systematic clinical trial to compare remiedies for scurvy. Lind selected 12 men from the ship, all suffering from scurvy. Lind limited his subjects to men who "were as similar as I could have them", that is he provided strict entry requirements to reduce extraneous variation. He divided them into six pairs, giving each pair different supplements to their basic diet for two weeks. 8 The treatments were all remedies that had been proposed: A quart of cider every day Twenty five gutts (drops) of vitriol (sulphuric acid) three times a day upon an empty stomach One half-pint of seawater every day A mixture of garlic, mustard, and horseradish in a lump the size of a nutmeg Two spoonfuls of vinegar three times a day Two oranges and one lemon every day The citrus treatment stopped after six days when they ran out of fruit, but by that time one sailor was fit for duty while the other had almost recovered. Apart from that, only group one (cider) showed some effect of its treatment. The remainder of the crew presumably served as a control, but Lind did not report results from any control (untreated) group. STATISTICAL EXPERIMENTS, FOLLOWING CHARLES S. PEIRCE A theory of statistical inference was developed by Charles S. Peirce in "Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883), two publications that emphasized the importance of randomization-based inference in statistics. RANDOMIZED EXPERIMENTS Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights. Peirce's experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the 1800s. OPTIMAL DESIGNS FOR REGRESSION MODELS Charles S. Peirce also contributed the first English-language publication on an optimal design for regression models in 1876. A pioneering optimal design for polynomial regression was suggested by Gergonne in 1815. In 1918 Kirstine Smith published optimal designs for polynomials of degree six (and less). Sequences of experiments The use of a sequence of experiments, where the design of each may depend on the results of previous experiments, including the possible decision to stop experimenting, is within the scope of Sequential analysis, a field that was pioneered by Abraham Wald in the context of sequential tests of statistical hypotheses. 9 Herman Chernoff wrote an overview of optimal sequential designs, while adaptive designs have been surveyed by S. Zacks. One specific type of sequential design is the "two-armed bandit", generalized to the multi-armed bandit, on which early work was done by Herbert Robbins in 1952. Principles of experimental design, following Ronald A. Fisher A methodology for designing experiments was proposed by Ronald A. Fisher, in his innovative books: "The Arrangement of Field Experiments" (1926) and The Design of Experiments (1935). Much of his pioneering work dealt with agricultural applications of statistical methods. As a mundane example, he described how to test the hypothesis that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed in the cup (AKA the "Lady tasting tea" experiment). These methods have been broadly adapted in the physical and social sciences, and are still used in agricultural engineering. The concepts presented here differ from the design and analysis of computer experiments. COMPARISON In some fields of study it is not possible to have independent measurements to a traceable standard. Comparisons between treatments are much more valuable and are usually preferable. Often one compares against a scientific control or traditional treatment that acts as baseline. RANDOMIZATION Random assignment is the process of assigning individuals at random to groups or to different groups in an experiment. The random assignment of individuals to groups (or conditions within a group) distinguishes a rigorous, "true" experiment from an observational study or "quasi-experiment". There is an extensive body of mathematical theory that explores the consequences of making the allocation of units to treatments by means of some random mechanism such as tables of random numbers, or the use of randomization devices such as playing cards or dice. Assigning units to treatments at random tends to mitigate confounding, which makes effects due to factors other than the treatment to appear to result from the treatment. The risks associated with random allocation (such as having a serious imbalance in a key characteristic between a treatment group and a control group) are calculable and hence can be managed down to an acceptable level by using enough experimental units. The results of an experiment can be generalized reliably from the experimental units to a larger population of units only if the experimental units are a random sample from the larger population; the probable error of such an extrapolation depends on the sample size, among other things. Random does not mean haphazard, and great care must be taken that appropriate random methods are used. 10 REPLICATION Measurements are usually subject to variation and uncertainty. Measurements are repeated and full experiments are replicated to help identify the sources of variation, to better estimate the true effects of treatments, to further strengthen the experiment's reliability and validity, and to add to the existing knowledge of the topic. However, certain conditions must be met before the replication of the experiment is commenced: the original research question has been published in a peer-reviewed journal or widely cited, the researcher is independent of the original experiment, the researcher must first try to replicate the original findings using the original data, and the write-up should state that the study conducted is a replication study that tried to follow the original study as strictly as possible. BLOCKING Blocking is the arrangement of experimental units into groups (blocks/lots) consisting of units that are similar to one another. Blocking reduces known but irrelevant sources of variation between units and thus allows greater precision in the estimation of the source of variation under study. ORTHOGONALITY Orthogonality concerns the forms of comparison (contrasts) that can be legitimately and efficiently carried out. Contrasts can be represented by vectors and sets of orthogonal contrasts are uncorrelated and independently distributed if the data are normal. Because of this independence, each orthogonal treatment provides different information to the others. If there are T treatments and T – 1 orthogonal contrasts, all the information that can be captured from the experiment is obtainable from the set of contrasts. FACTORIAL EXPERIMENTS Use of factorial experiments instead of the one-factor-at-a-time method. These are efficient at evaluating the effects and possible interactions of several factors (independent variables). Analysis of experiment design is built on the foundation of the analysis of variance, a collection of models that partition the observed variance into components, according to what factors the experiment must estimate or test. Example Weights of eight objects are measured using a pan balance and set of standard weights. Each weighing measures the weight difference between objects in the left pan vs. any objects in the right pan by adding calibrated weights to the lighter pan until the balance is in equilibrium. Each measurement has a random error. The average error is zero; the standard deviations of the probability distribution of the errors is the same number σ on different weighings; and errors on different weighings are independent. 11 Denote the true weights by We consider two different experiments: 1. Weigh each object in one pan, with the other pan empty. Let Xi be the measured weight of the ith object, for i = 1, ..., 8. 2. Do the eight weighings according to the following schedule and let Yi be the measured difference for i = 1, ..., 8: Then the estimated value of the weight θ1 is Similar estimates can be found for the weights of the other items. For example The question of design of experiments is: which experiment is better? The variance of the estimate X1 of θ1 is σ2 if we use the first experiment. But if we use the second experiment, the variance of the estimate given above is σ2/8. Thus the second experiment gives us 8 times as much precision for the estimate of a single item, and estimates all items simultaneously, with the same precision. What the second experiment achieves with eight would require 64 weighings if the items are weighed separately. However, note that the estimates for the items obtained in the second experiment have errors that correlate with each other. Many problems of the design of experiments involve combinatorial designs, as in this example and others. 12 Avoiding false positives False positive conclusions, often resulting from the pressure to publish or the author's own confirmation bias, are an inherent hazard in many fields, and experimental designs with undisclosed degrees of freedom are a problem. This can lead to conscious or unconscious "phacking": trying multiple things until you get the desired result. It typically involves the manipulation - perhaps unconsciously - of the process of statistical analysis and the degrees of freedom until they return a figure below the p<.05 level of statistical significance. So the design of the experiment should include a clear statement proposing the analyses to be undertaken. Clear and complete documentation of the experimental methodology is also important in order to support replication of results. Discussion topics when setting up an experimental design An experimental design or randomized clinical trial requires careful consideration of several factors before actually doing the experiment. An experimental design is the laying out of a detailed experimental plan in advance of doing the experiment. Some of the following topics have already been discussed in the principles of experimental design section: 1. How many factors does the design have? and are the levels of these factors fixed or random? 2. Are control conditions needed, and what should they be? 3. Manipulation checks; did the manipulation really work? 4. What are the background variables? 5. What is the sample size. How many units must be collected for the experiment to be generalisable and have enough power? 6. What is the relevance of interactions between factors? 7. What is the influence of delayed effects of substantive factors on outcomes? 8. How do response shifts affect self-report measures? 9. How feasible is repeated administration of the same measurement instruments to the same units at different occasions, with a post-test and follow-up tests? 10. What about using a proxy pretest? 11. Are there lurking variables? 12. Should the client/patient, researcher or even the analyst of the data be blind to conditions? 13. What is the feasibility of subsequent application of different conditions to the same units? 14. How many of each control and noise factors should be taken into account? 13 Statistical control It is best that a process be in reasonable statistical control prior to conducting designed experiments. When this is not possible, proper blocking, replication, and randomization allow for the careful conduct of designed experiments. To control for nuisance variables, researchers institute control checks as additional measures. Investigators should ensure that uncontrolled influences (e.g., source credibility perception) do not skew the findings of the study. A manipulation check is one example of a control check. Manipulation checks allow investigators to isolate the chief variables to strengthen support that these variables are operating as planned. One of the most important requirements of experimental research designs is the necessity of eliminating the effects of spurious, intervening, and antecedent variables. In the most basic model, cause (X) leads to effect (Y). But there could be a third variable (Z) that influences (Y), and X might not be the true cause at all. Z is said to be a spurious variable and must be controlled for. The same is true for intervening variables (a variable in between the supposed cause (X) and the effect (Y)), and anteceding variables (a variable prior to the supposed cause (X) that is the true cause). When a third variable is involved and has not been controlled for, the relation is said to be a zero order relationship. In most practical applications of experimental research designs there are several causes (X1, X2, X3). In most designs, only one of these causes is manipulated at a time. Experimental designs after Fisher Some efficient designs for estimating several main effects were found independently and in near succession by Raj Chandra Bose and K. Kishen in 1940 at the Indian Statistical Institute, but remained little known until the Plackett-Burman designs were published in Biometrika in 1946. About the same time, C. R. Rao introduced the concepts of orthogonal arrays as experimental designs. This concept played a central role in the development of Taguchi methods by Genichi Taguchi, which took place during his visit to Indian Statistical Institute in early 1950s. His methods were successfully applied and adopted by Japanese and Indian industries and subsequently were also embraced by US industry albeit with some reservations. In 1950, Gertrude Mary Cox and William Gemmell Cochran published the book Experimental Designs, which became the major reference work on the design of experiments for statisticians for years afterwards. Developments of the theory of linear models have encompassed and surpassed the cases that concerned early writers. Today, the theory rests on advanced topics in linear algebra, algebra and combinatorics. As with other branches of statistics, experimental design is pursued using both frequentist and Bayesian approaches: In evaluating statistical procedures like experimental designs, frequentist statistics studies the sampling distribution while Bayesian statistics updates a probability distribution on the parameter space. 14 Some important contributors to the field of experimental designs are C. S. Peirce, R. A. Fisher, F. Yates, C. R. Rao, R. C. Bose, J. N. Srivastava, Shrikhande S. S., D. Raghavarao, W. G. Cochran, O. Kempthorne, W. T. Federer, V. V. Fedorov, A. S. Hedayat, J. A. Nelder, R. A. Bailey, J. Kiefer, W. J. Studden, A. Pázman, F. Pukelsheim, D. R. Cox, H. P. Wynn, A. C. Atkinson, G. E. P. Box and G. Taguchi. The textbooks of D. Montgomery and R. Myers have reached generations of students and practitioners. Human participant experimental design constraints Laws and ethical considerations preclude some carefully designed experiments with human subjects. Legal constraints are dependent on jurisdiction. Constraints may involve institutional review boards, informed consent and confidentiality affecting both clinical (medical) trials and behavioral and social science experiments. In the field of toxicology, for example, experimentation is performed on laboratory animals with the goal of defining safe exposure limits for humans. Balancing the constraints are views from the medical field. Regarding the randomization of patients, "... if no one knows which therapy is better, there is no ethical imperative to use one therapy or another." (p 380) Regarding experimental design, "...it is clearly not ethical to place subjects at risk to collect data in a poorly designed study when this situation can be easily avoided...". (p 393) Randomized experiment Flowchart of four phases (enrollment, intervention allocation, follow-up, and data analysis) of a parallel randomized trial of two groups, modified from the CONSORT 2010 Statement In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling. 15 Overview In the statistical theory of design of experiments, randomization involves randomly allocating the experimental units across the treatment groups. For example, if an experiment compares a new drug against a standard drug, then the patients should be allocated to either the new drug or to the standard drug control using randomization. Randomized experimentation is not haphazard. Randomization reduces bias by equalising other factors that have not been explicitly accounted for in the experimental design (according to the law of large numbers). Randomization also produces ignorable designs, which are valuable in model-based statistical inference, especially Bayesian or likelihood-based. In the design of experiments, the simplest design for comparing treatments is the "completely randomized design". Some "restriction on randomization" can occur with blocking and experiments that have hard-to-change factors; additional restrictions on randomization can occur when a full randomization is infeasible or when it is desirable to reduce the variance of estimators of selected effects. Randomization of treatment in clinical trials pose ethical problems. In some cases, randomization reduces the therapeutic options for both physician and patient, and so randomization requires clinical equipoise regarding the treatments. Online Randomized Controlled Experiments Web sites can run randomized controlled experiments to create a feedback loop. Key differences between offline experimentation and online experiments include: Logging: user interactions can be logged reliably. Number of users: large sites, such as Amazon, Bing/Microsoft, and Google run experiments, each with over a million users. Number of concurrent experiments: large sites run tens of overlapping, or concurrent, experiments. Robots, whether web crawlers from valid sources or malicious internet bots. Ability to ramp-up experiments from low percentages to higher percentages. Speed / performance has significant impact on key metrics. Ability to use the pre-experiment period as an A/A test to reduce variance. History The earliest controlled experiments appears to have been suggested in the Old Testament's Book of Daniel. King Nebuchadnezzar proposed that some Israelites eat "a daily amount of food and wine from the king's table." Daniel preferred a vegetarian diet, but the official was concerned that the king would "see you looking worse than the other young men your age? The king would then have my head because of you." Daniel then proposed the following 16 controlled experiment: "Test your servants for ten days. Give us nothing but vegetables to eat and water to drink. Then compare our appearance with that of the young men who eat the royal food, and treat your servants in accordance with what you see” (Daniel 1, 12– 13). Randomized experiments were institutionalized in psychology and education in the late eighteen-hundreds, following the invention of randomized experiments by C. S. Peirce. Outside of psychology and education, randomized experiments were popularized by R.A. Fisher in his book Statistical Methods for Research Workers, which also introduced additional principles of experimental design. Statistical Interpretation The Rubin Causal Model provides a common way to describe a randomized experiment. While the Rubin Causal Model provides a framework for defining the causal parameters (i.e., the effects of a randomized treatment on an outcome), the analysis of experiments can take a number of forms. Most commonly, randomized experiments are analyzed using ANOVA, Student's t-test, Regression analysis, or a similar statistical test. Empirical evidence that randomization makes a difference Empirically differences between randomized and non-randomized studies, and between adequately and inadequately randomized trials have been difficult to detect. Random assignment RANDOM ASSIGNMENT OR RANDOM PLACEMENT is an experimental technique for assigning human participants or animal subjects to different groups in an experiment (e.g., a treatment group versus a control group) using randomization, such as by a chance procedure (e.g., flipping a coin) or a random number generator. This ensures that each participant or subject has an equal chance of being placed in any group. Random assignment of participants helps to ensure that any differences between and within the groups are not systematic at the outset of the experiment. Thus, any differences between groups recorded at the end of the experiment can be more confidently attributed to the experimental procedures or treatment. Random assignment, blinding, and controlling are key aspects of the design of experiments, because they help ensure that the results are not spurious or deceptive via confounding. This is why randomized controlled trials are vital in clinical research, especially ones that can be double-blinded and placebo-controlled. Mathematically, there are distinctions between randomization, pseudorandomization, and quasirandomization, as well as between random number generators and pseudorandom number generators. How much these differences matter in experiments (such as clinical trials) is a matter of trial design and statistical rigor, which affect evidence grading. Studies done without randomization include quasi-experiments and open-label trials; those done with 17 pseudo- or quasirandomization are usually given nearly the same weight as those with true randomization but are viewed with a bit more caution. Benefits of random assignment Imagine an experiment in which the participants are not randomly assigned; perhaps the first 10 people to arrive are assigned to the Experimental Group, and the last 10 people to arrive are assigned to the Control group. At the end of the experiment, the experimenter finds differences between the Experimental group and the Control group, and claims these differences are a result of the experimental procedure. However, they also may be due to some other preexisting attribute of the participants, e.g. people who arrive early versus people who arrive late. Imagine the experimenter instead uses a coin flip to randomly assign participants. If the coin lands heads-up, the participant is assigned to the Experimental Group. If the coin lands tailsup, the participant is assigned to the Control Group. At the end of the experiment, the experimenter finds differences between the Experimental group and the Control group. Because each participant had an equal chance of being placed in any group, it is unlikely the differences could be attributable to some other preexisting attribute of the participant, e.g. those who arrived on time versus late. Potential issues Random assignment does not guarantee that the groups are matched or equivalent. The groups may still differ on some preexisting attribute due to chance. The use of random assignment cannot eliminate this possibility, but it greatly reduces it. To express this same idea statistically - If a randomly assigned group is compared to the mean it may be discovered that they differ, even though they were assigned from the same group. If a test of statistical significance is applied to randomly assigned groups to test the difference between sample means against the null hypothesis that they are equal to the same population mean (i.e., population mean of differences = 0), given the probability distribution, the null hypothesis will sometimes be "rejected," that is, deemed not plausible. That is, the groups will be sufficiently different on the variable tested to conclude statistically that they did not come from the same population, even though, procedurally, they were assigned from the same total group. For example, using random assignment may create an assignment to groups that has 20 blue-eyed people and 5 brown-eyed people in one group. This is a rare event under random assignment, but it could happen, and when it does it might add some doubt to the causal agent in the experimental hypothesis. 18 Random Sampling Random sampling is a related, but distinct process. Random sampling refers to recruiting participants in a way that they represent a larger population. Because most basic statistical tests require the hypothesis of an independent randomly sampled population, random assignment is the desired assignment method because it provides control for all attributes of the members of the samples—in contrast to matching on only one or more variables—and provides the mathematical basis for estimating the likelihood of group equivalence for characteristics one is interested in, both for pretreatment checks on equivalence and the evaluation of post treatment results using inferential statistics. More advanced statistical modeling can be used to adapt the inference to the sampling method. History Randomization was emphasized in the theory of statistical inference of Charles S. Peirce in "Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883). Peirce applied randomization in the Peirce-Jastrow experiment on weight perception. Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights. Peirce's experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the eighteen-hundreds. Jerzy Neyman advocated randomization in survey sampling (1934) and in experiments (1923). Ronald A. Fisher advocated randomization in his book on experimental design (1935). Replication (statistics) In engineering, science, and statistics, replication is the repetition of an experimental condition so that the variability associated with the phenomenon can be estimated. ASTM, in standard E1847, defines replication as "the repetition of the set of all the treatment combinations to be compared in an experiment. Each of the repetitions is called a replicate." Replication is not the same as repeated measurements of the same item: they are dealt with differently in statistical experimental design and data analysis. For proper sampling, a process or batch of products should be in reasonable statistical control; inherent random variation is present but variation due to assignable (special) causes is not. Evaluation or testing of a single item does not allow for item-to-item variation and may not represent the batch or process. Replication is needed to account for this variation among items and treatments. 19 Example As an example, consider a continuous process which produces items. Batches of items are then processed or treated. Finally, tests or measurements are conducted. Several options might be available to obtain ten test values. Some possibilities are: One finished and treated item might be measured repeatedly to obtain ten test results. Only one item was measured so there is no replication. The repeated measurements help identify observational error. Ten finished and treated items might be taken from a batch and each measured once. This is not full replication because the ten samples are not random and not representative of the continuous nor batch processing. Five items are taken from the continuous process based on sound statistical sampling. These are processed in a batch and tested twice each. This includes replication of initial samples but does not allow for batch-to-batch variation in processing. The repeated tests on each provide some measure and control of testing error. Five items are taken from the continuous process based on sound statistical sampling. These are processed in five different batches and tested twice each. This plan includes proper replication of initial samples and also includes batch-to-batch variation. The repeated tests on each provide some measure and control of testing error. Each option would call for different data analysis methods and yield different conclusions. Factorial experiment In statistics, a full factorial experiment is an experiment whose design consists of two or more factors, each with discrete possible values or "levels", and whose experimental units take on all possible combinations of these levels across all such factors. A full factorial design may also be called a fully crossed design. Such an experiment allows the investigator to study the effect of each factor on the response variable, as well as the effects of interactions between factors on the response variable. For the vast majority of factorial experiments, each factor has only two levels. For example, with two factors each taking two levels, a factorial experiment would have four treatment combinations in total, and is usually called a 2×2 factorial design. If the number of combinations in a full factorial design is too high to be logistically feasible, a fractional factorial design may be done, in which some of the possible combinations (usually at least half) are omitted. 20 History Factorial designs were used in the 19th century by John Bennet Lawes and Joseph Henry Gilbert of the Rothamsted Experimental Station. Ronald Fisher argued in 1926 that "complex" designs (such as factorial designs) were more efficient than studying one factor at a time. Fisher wrote, "No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken." Nature, he suggests, will best respond to "a logical and carefully thought out questionnaire". A factorial design allows the effect of several factors and even interactions between them to be determined with the same number of trials as are necessary to determine any one of the effects by itself with the same degree of accuracy. Frank Yates made significant contributions, particularly in the analysis of designs, by the Yates analysis. The term "factorial" may not have been used in print before 1935, when Fisher used it in his book The Design of Experiments. Example The simplest factorial experiment contains two levels for each of two factors. Suppose an engineer wishes to study the total power used by each of two different motors, A and B, running at each of two different speeds, 2000 or 3000 RPM. The factorial experiment would consist of four experimental units: motor A at 2000 RPM, motor B at 2000 RPM, motor A at 3000 RPM, and motor B at 3000 RPM. Each combination of a single level selected from every factor is present once. This experiment is an example of a 22 (or 2x2) factorial experiment, so named because it considers two levels (the base) for each of two factors (the power or superscript), or #levels#factors, producing 22=4 factorial points. 21 Designs can involve many independent variables. As a further example, the effects of three input variables can be evaluated in eight experimental conditions shown as the corners of a cube. This can be conducted with or without replication, depending on its intended purpose and available resources. It will provide the effects of the three independent variables on the dependent variable and possible interactions. Notation 2×2 factorial experiment A B (1) − − a + − b − + ab + + The notation used to denote factorial experiments conveys a lot of information. When a design is denoted a 23 factorial, this identifies the number of factors (3); how many levels each factor has (2); and how many experimental conditions there are in the design (23=8). Similarly, a 25 design has five factors, each with two levels, and 25=32 experimental conditions; and a 32 design has two factors, each with three levels, and 32=9 experimental conditions. Factorial experiments can involve factors with different numbers of levels. A 243 design has five factors, four with two levels and one with three levels, and has 16 X 3=48 experimental conditions. 22 To save space, the points in a two-level factorial experiment are often abbreviated with strings of plus and minus signs. The strings have as many symbols as factors, and their values dictate the level of each factor: conventionally, for the first (or low) level, and for the second (or high) level. The points in this experiment can thus be represented as , , , and . The factorial points can also be abbreviated by (1), a, b, and ab, where the presence of a letter indicates that the specified factor is at its high (or second) level and the absence of a letter indicates that the specified factor is at its low (or first) level (for example, "a" indicates that factor A is on its high setting, while all other factors are at their low (or first) setting). (1) is used to indicate that all factors are at their lowest (or first) values. Implementation For more than two factors, a 2k factorial experiment can be usually recursively designed from a 2k-1 factorial experiment by replicating the 2k-1 experiment, assigning the first replicate to the first (or low) level of the new factor, and the second replicate to the second (or high) level. This framework can be generalized to, e.g., designing three replicates for three level factors, etc. A factorial experiment allows for estimation of experimental error in two ways. The experiment can be replicated, or the sparsity-of-effects principle can often be exploited. Replication is more common for small experiments and is a very reliable way of assessing experimental error. When the number of factors is large (typically more than about 5 factors, but this does vary by application), replication of the design can become operationally difficult. In these cases, it is common to only run a single replicate of the design, and to assume that factor interactions of more than a certain order (say, between three or more factors) are negligible. Under this assumption, estimates of such high order interactions are estimates of an exact zero, thus really an estimate of experimental error. When there are many factors, many experimental runs will be necessary, even without replication. For example, experimenting with 10 factors at two levels each produces 210=1024 combinations. At some point this becomes infeasible due to high cost or insufficient resources. In this case, fractional factorial designs may be used. As with any statistical experiment, the experimental runs in a factorial experiment should be randomized to reduce the impact that bias could have on the experimental results. In practice, this can be a large operational challenge. Factorial experiments can be used when there are more than two levels of each factor. However, the number of experimental runs required for three-level (or more) factorial designs will be considerably greater than for their two-level counterparts. Factorial designs are therefore less attractive if a researcher wishes to consider more than two levels. 23 Analysis A factorial experiment can be analyzed using ANOVA or regression analysis. It is relatively easy to estimate the main effect for a factor. To compute the main effect of a factor "A", subtract the average response of all experimental runs for which A was at its low (or first) level from the average response of all experimental runs for which A was at its high (or second) level. Other useful exploratory analysis tools for factorial experiments include main effects plots, interaction plots, and a normal probability plot of the estimated effects. When the factors are continuous, two-level factorial designs assume that the effects are linear. If a quadratic effect is expected for a factor, a more complicated experiment should be used, such as a central composite design. Optimization of factors that could have quadratic effects is the primary goal of response surface methodology. 24