Randomized experiment

advertisement
Experiment | Randomized experiment
An experiment is an orderly procedure carried out with the goal of verifying, refuting, or
establishing the validity of a hypothesis. Experiments provide insight into cause-and-effect by
demonstrating what outcome occurs when a particular factor is manipulated. Experiments
vary greatly in their goal and scale, but always rely on repeatable procedure and logical
analysis of the results. There also exist natural experimental studies.
A child may carry out basic experiments to understand the nature of gravity, while teams of
scientists may take years of systematic investigation to advance the understanding of a
phenomenon. Experiments and other types of hands-on activities are very important to student
learning in the science classroom. Experiments can raise test scores and help a student
become more engaged and interested in the material they are learning, especially when used
over time. Experiments can vary from personal and informal natural comparisons (e.g. tasting
a range of chocolates to find a favorite), to highly controlled (e.g. tests requiring complex
apparatus overseen by many scientists that hope to discover information about subatomic
particles). Uses of experiments vary considerably between the natural and human sciences.
Experiments typically include controls, which are designed to minimize the effects of
variables other than the single independent variable. This increases the reliability of the
results, often through a comparison between control measurements and the other
measurements. Scientific controls are a part of the scientific method. Ideally, all variables in
an experiment will be controlled (accounted for by the control measurements) and none will
be uncontrolled. In such an experiment, if all the controls work as expected, it is possible to
conclude that the experiment is working as intended and that the results of the experiment are
due to the effect of the variable being tested.
Overview
In the scientific method, an experiment is an empirical procedure that arbitrates between
competing models or hypotheses. Researchers also use experimentation to test existing
theories or new hypotheses to support or disprove them.
An experiment usually tests a hypothesis, which is an expectation about how a particular
process or phenomenon works. However, an experiment may also aim to answer a "what-if"
question, without a specific expectation about what the experiment will reveal, or to confirm
prior results. If an experiment is carefully conducted, the results usually either support or
disprove the hypothesis. According to some Philosophies of science, an experiment can never
"prove" a hypothesis, it can only add support. Similarly, an experiment that provides a
counterexample can disprove a theory or hypothesis. An experiment must also control the
possible confounding factors—any factors that would mar the accuracy or repeatability of the
experiment or the ability to interpret the results. Confounding is commonly eliminated
through scientific controls and/or, in randomized experiments, through random assignment.
1
In engineering and other physical sciences, experiments are a primary component of the
scientific method. They are used to test theories and hypotheses about how physical processes
work under particular conditions (e.g., whether a particular engineering process can produce a
desired chemical compound). Typically, experiments in these fields will focus on replication
of identical procedures in hopes of producing identical results in each replication. Random
assignment is uncommon.
In medicine and the social sciences, the prevalence of experimental research varies widely
across disciplines. When used, however, experiments typically follow the form of the clinical
trial, where experimental units (usually individual human beings) are randomly assigned to a
treatment or control condition where one or more outcomes are assessed. In contrast to norms
in the physical sciences, the focus is typically on the average treatment effect (the difference
in outcomes between the treatment and control groups) or another test statistic produced by
the experiment. A single study will typically not involve replications of the experiment, but
separate studies may be aggregated through systematic review and meta-analysis.
Of course, these differences between experimental practice in each of the branches of science
have exceptions. For example, agricultural research frequently uses randomized experiments
(e.g., to test the comparative effectiveness of different fertilizers). Similarly, experimental
economics often involves experimental tests of theorized human behaviors without relying on
random assignment of individuals to treatment and control conditions.
History
Francis Bacon (1561–1626), an English philosopher and scientist active in the 17th century,
became an early and influential supporter of experimental science. He disagreed with the
method of answering scientific questions by deduction and described it as follows: "Having
first determined the question according to his will, man then resorts to experience, and
bending her to conformity with his placets, leads her about like a captive in a procession."
Bacon wanted a method that relied on repeatable observations, or experiments. Notably, he
first ordered the scientific method as we understand it today.
There remains simple experience; which, if taken as it comes, is called accident, if sought for,
experiment. The true method of experience first lights the candle [hypothesis], and then by
means of the candle shows the way [arranges and delimits the experiment]; commencing as it
does with experience duly ordered and digested, not bungling or erratic, and from it deducing
axioms [theories], and from established axioms again new experiments.
— Francis Bacon. Novum Organum. 1620.
In the centuries that followed, people who applied the scientific method in different areas
made important advances and discoveries. For example, Galileo Galilei (1564-1642)
accurately measured time and experimented to make accurate measurements and conclusions
about the speed of a falling body. Antoine Lavoisier (1743-1794), a French chemist, used
experiment to describe new areas, such as combustion and biochemistry and to develop the
2
theory of conservation of mass (matter). Louis Pasteur (1822-1895) used the scientific method
to disprove the prevailing theory of spontaneous generation and to develop the germ theory of
disease. Because of the importance of controlling potentially confounding variables, the use
of well-designed laboratory experiments is preferred when possible.
A considerable amount of progress on the design and analysis of experiments occurred in the
early 20th century, with contributions from statisticians such as Ronald Fisher (1890-1962),
Jerzy Neyman (1894-1981), Oscar Kempthorne (1919-2000), Gertrude Mary Cox (19001978), and William Gemmell Cochran (1909-1980), among others. This early work has
largely been synthesized under the label of the Rubin causal model, which formalizes earlier
statistical approaches to the analysis of experiments.
TYPES OF EXPERIMENT
Experiments might be categorized according to a number of dimensions, depending upon
professional norms and standards in different fields of study. In some disciplines (e.g.,
Psychology or Political Science), a 'true experiment' is a method of social research in which
there are two kinds of variables. The independent variable is manipulated by the
experimenter, and the dependent variable is measured. The signifying characteristic of a true
experiment is that it randomly allocates the subjects in order to neutralize the potential for
experimenter bias, and ensures, over a large number of iterations of the experiment, that all
confounding factors are controlled for.
CONTROLLED EXPERIMENTS
A controlled experiment often compares the results obtained from experimental samples
against control samples, which are practically identical to the experimental sample except for
the one aspect whose effect is being tested (the independent variable). A good example would
be a drug trial. The sample or group receiving the drug would be the experimental group
(treatment group); and the one receiving the placebo or regular treatment would be the control
one. In many laboratory experiments it is good practice to have several replicate samples for
the test being performed and have both a positive control and a negative control. The results
from replicate samples can often be averaged, or if one of the replicates is obviously
inconsistent with the results from the other samples, it can be discarded as being the result of
an experimental error (some step of the test procedure may have been mistakenly omitted for
that sample). Most often, tests are done in duplicate or triplicate. A positive control is a
procedure that is very similar to the actual experimental test but which is known from
previous experience to give a positive result. A negative control is known to give a negative
result. The positive control confirms that the basic conditions of the experiment were able to
produce a positive result, even if none of the actual experimental samples produce a positive
result. The negative control demonstrates the base-line result obtained when a test does not
produce a measurable positive result. Most often the value of the negative control is treated as
a "background" value to subtract from the test sample results. Sometimes the positive control
takes the quadrant of a standard curve.
3
An example that is often used in teaching laboratories is a controlled protein assay. Students
might be given a fluid sample containing an unknown (to the student) amount of protein. It is
their job to correctly perform a controlled experiment in which they determine the
concentration of protein in fluid sample (usually called the "unknown sample"). The teaching
lab would be equipped with a protein standard solution with a known protein concentration.
Students could make several positive control samples containing various dilutions of the
protein standard. Negative control samples would contain all of the reagents for the protein
assay but no protein. In this example, all samples are performed in duplicate. The assay is a
colorimetric assay in which a spectrophotometer can measure the amount of protein in
samples by detecting a colored complex formed by the interaction of protein molecules and
molecules of an added dye. In the illustration, the results for the diluted test samples can be
compared to the results of the standard curve (the blue line in the illustration) in order to
determine an estimate of the amount of protein in the unknown sample.
Controlled experiments can be performed when it is difficult to exactly control all the
conditions in an experiment. In this case, the experiment begins by creating two or more
sample groups that are probabilistically equivalent, which means that measurements of traits
should be similar among the groups and that the groups should respond in the same manner if
given the same treatment. This equivalency is determined by statistical methods that take into
account the amount of variation between individuals and the number of individuals in each
group. In fields such as microbiology and chemistry, where there is very little variation
between individuals and the group size is easily in the millions, these statistical methods are
often bypassed and simply splitting a solution into equal parts is assumed to produce identical
sample groups.
Once equivalent groups have been formed, the experimenter tries to treat them identically
except for the one variable that he or she wishes to isolate. Human experimentation requires
special safeguards against outside variables such as the placebo effect. Such experiments are
generally double blind, meaning that neither the volunteer nor the researcher knows which
individuals are in the control group or the experimental group until after all of the data have
been collected. This ensures that any effects on the volunteer are due to the treatment itself
and are not a response to the knowledge that he is being treated.
In human experiments, researchers may give a subject (person) a stimulus that the subject
responds to. The goal of the experiment is to measure the response to the stimulus by a test
method.
In the design of experiments, two or more "treatments" are applied to estimate the difference
between the mean responses for the treatments. For example, an experiment on baking bread
could estimate the difference in the responses associated with quantitative variables, such as
the ratio of water to flour, and with qualitative variables, such as strains of yeast.
Experimentation is the step in the scientific method that helps people decide between two or
more competing explanations – or hypotheses. These hypotheses suggest reasons to explain a
phenomenon, or predict the results of an action. An example might be the hypothesis that "if I
release this ball, it will fall to the floor": this suggestion can then be tested by carrying out the
experiment of letting go of the ball, and observing the results. Formally, a hypothesis is
compared against its opposite or null hypothesis ("if I release this ball, it will not fall to the
floor"). The null hypothesis is that there is no explanation or predictive power of the
4
phenomenon through the reasoning that is being investigated. Once hypotheses are defined,
an experiment can be carried out - and the results analysed - in order to confirm, refute, or
define the accuracy of the hypotheses.
NATURAL EXPERIMENTS
The term "experiment" usually implies a controlled experiment, but sometimes controlled
experiments are prohibitively difficult or impossible. In this case researchers resort to natural
experiments or quasi-experiments. Natural experiments rely solely on observations of the
variables of the system under study, rather than manipulation of just one or a few variables as
occurs in controlled experiments. To the degree possible, they attempt to collect data for the
system in such a way that contribution from all variables can be determined, and where the
effects of variation in certain variables remain approximately constant so that the effects of
other variables can be discerned.
The degree to which this is possible depends on the observed correlation between explanatory
variables in the observed data. When these variables are not well correlated, natural
experiments can approach the power of controlled experiments. Usually, however, there is
some correlation between these variables, which reduces the reliability of natural experiments
relative to what could be concluded if a controlled experiment were performed. Also, because
natural experiments usually take place in uncontrolled environments, variables from
undetected sources are neither measured nor held constant, and these may produce illusory
correlations in variables under study.
Much research in several important science disciplines, including economics, political
science, geology, paleontology, ecology, meteorology, and astronomy, relies on quasiexperiments. For example, in astronomy it is clearly impossible, when testing the hypothesis
"suns are collapsed clouds of hydrogen", to start out with a giant cloud of hydrogen, and then
perform the experiment of waiting a few billion years for it to form a sun.
However, by observing various clouds of hydrogen in various states of collapse, and other
implications of the hypothesis (for example, the presence of various spectral emissions from
the light of stars), we can collect data we require to support the hypothesis. An early example
of this type of experiment was the first verification in the 17th century that light does not
travel from place to place instantaneously, but instead has a measurable speed.
Observation of the appearance of the moons of Jupiter were slightly delayed when Jupiter was
farther from Earth, as opposed to when Jupiter was closer to Earth; and this phenomenon was
used to demonstrate that the difference in the time of appearance of the moons was consistent
with a measurable speed.
5
FIELD EXPERIMENTS
Field experiments are so named in order to draw a contrast with laboratory experiments,
which enforce scientific control by testing a hypothesis in the artificial and highly controlled
setting of a laboratory. Often used in the social sciences, and especially in economic analyses
of education and health interventions, field experiments have the advantage that outcomes are
observed in a natural setting rather than in a contrived laboratory environment.
For this reason, field experiments are sometimes seen as having higher external validity than
laboratory experiments. However, like natural experiments, field experiments suffer from the
possibility of contamination: experimental conditions can be controlled with more precision
and certainty in the lab. Yet some phenomena (e.g., voter turnout in an election) cannot be
easily studied in a laboratory.
Contrast with observational study
An observational study is used when it is impractical, unethical, cost-prohibitive (or otherwise
inefficient) to fit a physical or social system into a laboratory setting, to completely control
confounding factors, or to apply random assignment. It can also be used when confounding
factors are either limited or known well enough to analyze the data in light of them (though
this may be rare when social phenomena are under examination). In order for an observational
science to be valid, confounding factors must be known and accounted for. In these situations,
observational studies have value because they often suggest hypotheses that can be tested with
randomized experiments or by collecting fresh data.
Fundamentally, however, observational studies are not experiments. By definition,
observational studies lack the manipulation required for Baconian experiments. In addition,
observational studies (e.g., in biological or social systems) often involve variables that are
difficult to quantify or control. Observational studies are limited because they lack the
statistical properties of randomized experiments. In a randomized experiment, the method of
randomization specified in the experimental protocol guides the statistical analysis, which is
usually specified also by the experimental protocol.
Without a statistical model that reflects an objective randomization, the statistical analysis
relies on a subjective model. Inferences from subjective models are unreliable in theory and
practice. In fact, there are several cases where carefully conducted observational studies
consistently give wrong results, that is, where the results of the observational studies are
inconsistent and also differ from the results of experiments. For example, epidemiological
studies of colon cancer consistently show beneficial correlations with broccoli consumption,
while experiments find no benefit.
A particular problem with observational studies involving human subjects is the great
difficulty attaining fair comparisons between treatments (or exposures), because such studies
are prone to selection bias, and groups receiving different treatments (exposures) may differ
greatly according to their covariates (age, height, weight, medications, exercise, nutritional
6
status, ethnicity, family medical history, etc.). In contrast, randomization implies that for each
covariate, the mean for each group is expected to be the same. For any randomized trial, some
variation from the mean is expected, of course, but the randomization ensures that the
experimental groups have mean values that are close, due to the central limit theorem and
Markov's inequality.
With inadequate randomization or low sample size, the systematic variation in covariates
between the treatment groups (or exposure groups) makes it difficult to separate the effect of
the treatment (exposure) from the effects of the other covariates, most of which have not been
measured. The mathematical models used to analyze such data must consider each differing
covariate (if measured), and the results will not be meaningful if a covariate is neither
randomized nor included in the model.
To avoid conditions that render an experiment far less useful, physicians conducting medical
trials, say for U.S. Food and Drug Administration approval, will quantify and randomize the
covariates that can be identified. Researchers attempt to reduce the biases of observational
studies with complicated statistical methods such as propensity score matching methods,
which require large populations of subjects and extensive information on covariates.
Outcomes are also quantified when possible (bone density, the amount of some cell or
substance in the blood, physical strength or endurance, etc.) and not based on a subject's or a
professional observer's opinion. In this way, the design of an observational study can render
the results more objective and therefore, more convincing.
Ethics
By placing the distribution of the independent variable(s) under the control of the researcher,
an experiment - particularly when it involves human subjects - introduces potential ethical
considerations, such as balancing benefit and harm, fairly distributing interventions (e.g.,
treatments for a disease), and informed consent. For example in psychology or health care, it
is unethical to provide a substandard treatment to patients.
Therefore, ethical review boards are supposed to stop clinical trials and other experiments
unless a new treatment is believed to offer benefits as good as current best practice. It is also
generally unethical (and often illegal) to conduct randomized experiments on the effects of
substandard or harmful treatments, such as the effects of ingesting arsenic on human health.
To understand the effects of such exposures, scientists sometimes use observational studies to
understand the effects of those factors.
Even when experimental research does not directly involve human subjects, it may still
present ethical concerns. For example, the nuclear bomb experiments conducted by the
Manhattan Project implied the use of nuclear reactions to harm human beings even though the
experiments did not directly involve any human subjects.
7
Design of experiments
Design of experiments with full factorial design (left), response surface with second-degree
polynomial (right)
In general usage, design of experiments (DOE) or experimental design is the design of any
information-gathering exercises where variation is present, whether under the full control of
the experimenter or not. However, in statistics, these terms are usually used for controlled
experiments. Formal planned experimentation is often used in evaluating physical objects,
chemical formulations, structures, components, and materials.
Other types of study, and their design, are discussed in the articles on computer experiments,
opinion polls and statistical surveys (which are types of observational study), natural
experiments and quasi-experiments (for example, quasi-experimental design). See Experiment
for the distinction between these types of experiments or studies.
In the design of experiments, the experimenter is often interested in the effect of some process
or intervention (the "treatment") on some objects (the "experimental units"), which may be
people, parts of people, groups of people, plants, animals, etc. Design of experiments is thus a
discipline that has very broad application across all the natural and social sciences and
engineering.
History
SYSTEMATIC CLINICAL TRIALS
In 1747, while serving as surgeon on HMS Salisbury, James Lind carried out a systematic
clinical trial to compare remiedies for scurvy. Lind selected 12 men from the ship, all
suffering from scurvy.
Lind limited his subjects to men who "were as similar as I could have them", that is he
provided strict entry requirements to reduce extraneous variation. He divided them into six
pairs, giving each pair different supplements to their basic diet for two weeks.
8
The treatments were all remedies that had been proposed:






A quart of cider every day
Twenty five gutts (drops) of vitriol (sulphuric acid) three times a day upon an empty stomach
One half-pint of seawater every day
A mixture of garlic, mustard, and horseradish in a lump the size of a nutmeg
Two spoonfuls of vinegar three times a day
Two oranges and one lemon every day
The citrus treatment stopped after six days when they ran out of fruit, but by that time one
sailor was fit for duty while the other had almost recovered. Apart from that, only group one
(cider) showed some effect of its treatment. The remainder of the crew presumably served as
a control, but Lind did not report results from any control (untreated) group.
STATISTICAL EXPERIMENTS, FOLLOWING CHARLES S.
PEIRCE
A theory of statistical inference was developed by Charles S. Peirce in "Illustrations of the
Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883), two
publications that emphasized the importance of randomization-based inference in statistics.
RANDOMIZED EXPERIMENTS
Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to
evaluate their ability to discriminate weights. Peirce's experiment inspired other researchers in
psychology and education, which developed a research tradition of randomized experiments
in laboratories and specialized textbooks in the 1800s.
OPTIMAL DESIGNS FOR REGRESSION MODELS
Charles S. Peirce also contributed the first English-language publication on an optimal design
for regression models in 1876. A pioneering optimal design for polynomial regression was
suggested by Gergonne in 1815. In 1918 Kirstine Smith published optimal designs for
polynomials of degree six (and less).
Sequences of experiments
The use of a sequence of experiments, where the design of each may depend on the results of
previous experiments, including the possible decision to stop experimenting, is within the
scope of Sequential analysis, a field that was pioneered by Abraham Wald in the context of
sequential tests of statistical hypotheses.
9
Herman Chernoff wrote an overview of optimal sequential designs, while adaptive designs
have been surveyed by S. Zacks. One specific type of sequential design is the "two-armed
bandit", generalized to the multi-armed bandit, on which early work was done by Herbert
Robbins in 1952.
Principles of experimental design, following Ronald A.
Fisher
A methodology for designing experiments was proposed by Ronald A. Fisher, in his
innovative books: "The Arrangement of Field Experiments" (1926) and The Design of
Experiments (1935). Much of his pioneering work dealt with agricultural applications of
statistical methods. As a mundane example, he described how to test the hypothesis that a
certain lady could distinguish by flavour alone whether the milk or the tea was first placed in
the cup (AKA the "Lady tasting tea" experiment). These methods have been broadly adapted
in the physical and social sciences, and are still used in agricultural engineering.
The concepts presented here differ from the design and analysis of computer experiments.
COMPARISON
In some fields of study it is not possible to have independent measurements to a traceable
standard. Comparisons between treatments are much more valuable and are usually preferable.
Often one compares against a scientific control or traditional treatment that acts as baseline.
RANDOMIZATION
Random assignment is the process of assigning individuals at random to groups or to different
groups in an experiment. The random assignment of individuals to groups (or conditions
within a group) distinguishes a rigorous, "true" experiment from an observational study or
"quasi-experiment". There is an extensive body of mathematical theory that explores the
consequences of making the allocation of units to treatments by means of some random
mechanism such as tables of random numbers, or the use of randomization devices such as
playing cards or dice. Assigning units to treatments at random tends to mitigate confounding,
which makes effects due to factors other than the treatment to appear to result from the
treatment.
The risks associated with random allocation (such as having a serious imbalance in a key
characteristic between a treatment group and a control group) are calculable and hence can be
managed down to an acceptable level by using enough experimental units. The results of an
experiment can be generalized reliably from the experimental units to a larger population of
units only if the experimental units are a random sample from the larger population; the
probable error of such an extrapolation depends on the sample size, among other things.
Random does not mean haphazard, and great care must be taken that appropriate random
methods are used.
10
REPLICATION
Measurements are usually subject to variation and uncertainty. Measurements are repeated and
full experiments are replicated to help identify the sources of variation, to better estimate the
true effects of treatments, to further strengthen the experiment's reliability and validity, and to
add to the existing knowledge of the topic. However, certain conditions must be met before
the replication of the experiment is commenced: the original research question has been
published in a peer-reviewed journal or widely cited, the researcher is independent of the
original experiment, the researcher must first try to replicate the original findings using the
original data, and the write-up should state that the study conducted is a replication study that
tried to follow the original study as strictly as possible.
BLOCKING
Blocking is the arrangement of experimental units into groups (blocks/lots) consisting of units
that are similar to one another. Blocking reduces known but irrelevant sources of variation
between units and thus allows greater precision in the estimation of the source of variation
under study.
ORTHOGONALITY
Orthogonality concerns the forms of comparison (contrasts) that can be legitimately and
efficiently carried out. Contrasts can be represented by vectors and sets of orthogonal contrasts
are uncorrelated and independently distributed if the data are normal. Because of this
independence, each orthogonal treatment provides different information to the others. If there
are T treatments and T – 1 orthogonal contrasts, all the information that can be captured from
the experiment is obtainable from the set of contrasts.
FACTORIAL EXPERIMENTS
Use of factorial experiments instead of the one-factor-at-a-time method. These are efficient at
evaluating the effects and possible interactions of several factors (independent variables).
Analysis of experiment design is built on the foundation of the analysis of variance, a
collection of models that partition the observed variance into components, according to what
factors the experiment must estimate or test.
Example
Weights of eight objects are measured using a pan balance and set of standard weights. Each
weighing measures the weight difference between objects in the left pan vs. any objects in the
right pan by adding calibrated weights to the lighter pan until the balance is in equilibrium.
Each measurement has a random error. The average error is zero; the standard deviations of
the probability distribution of the errors is the same number σ on different weighings; and
errors on different weighings are independent.
11
Denote the true weights by
We consider two different experiments:
1. Weigh each object in one pan, with the other pan empty. Let Xi be the measured weight of the
ith object, for i = 1, ..., 8.
2. Do the eight weighings according to the following schedule and let Yi be the measured
difference for i = 1, ..., 8:
Then the estimated value of the weight θ1 is
Similar estimates can be found for the weights of the other items. For example
The question of design of experiments is: which experiment is better?
The variance of the estimate X1 of θ1 is σ2 if we use the first experiment. But if we use the
second experiment, the variance of the estimate given above is σ2/8. Thus the second
experiment gives us 8 times as much precision for the estimate of a single item, and estimates
all items simultaneously, with the same precision.
What the second experiment achieves with eight would require 64 weighings if the items are
weighed separately. However, note that the estimates for the items obtained in the second
experiment have errors that correlate with each other.
Many problems of the design of experiments involve combinatorial designs, as in this
example and others.
12
Avoiding false positives
False positive conclusions, often resulting from the pressure to publish or the author's own
confirmation bias, are an inherent hazard in many fields, and experimental designs with
undisclosed degrees of freedom are a problem. This can lead to conscious or unconscious "phacking": trying multiple things until you get the desired result. It typically involves the
manipulation - perhaps unconsciously - of the process of statistical analysis and the degrees of
freedom until they return a figure below the p<.05 level of statistical significance. So the
design of the experiment should include a clear statement proposing the analyses to be
undertaken.
Clear and complete documentation of the experimental methodology is also important in
order to support replication of results.
Discussion topics when setting up an experimental design
An experimental design or randomized clinical trial requires careful consideration of several
factors before actually doing the experiment. An experimental design is the laying out of a
detailed experimental plan in advance of doing the experiment. Some of the following topics
have already been discussed in the principles of experimental design section:
1. How many factors does the design have? and are the levels of these factors fixed or
random?
2. Are control conditions needed, and what should they be?
3. Manipulation checks; did the manipulation really work?
4. What are the background variables?
5. What is the sample size. How many units must be collected for the experiment to be
generalisable and have enough power?
6. What is the relevance of interactions between factors?
7. What is the influence of delayed effects of substantive factors on outcomes?
8. How do response shifts affect self-report measures?
9. How feasible is repeated administration of the same measurement instruments to the
same units at different occasions, with a post-test and follow-up tests?
10. What about using a proxy pretest?
11. Are there lurking variables?
12. Should the client/patient, researcher or even the analyst of the data be blind to
conditions?
13. What is the feasibility of subsequent application of different conditions to the same
units?
14. How many of each control and noise factors should be taken into account?
13
Statistical control
It is best that a process be in reasonable statistical control prior to conducting designed
experiments. When this is not possible, proper blocking, replication, and randomization allow
for the careful conduct of designed experiments. To control for nuisance variables,
researchers institute control checks as additional measures. Investigators should ensure that
uncontrolled influences (e.g., source credibility perception) do not skew the findings of the
study. A manipulation check is one example of a control check. Manipulation checks allow
investigators to isolate the chief variables to strengthen support that these variables are
operating as planned.
One of the most important requirements of experimental research designs is the necessity of
eliminating the effects of spurious, intervening, and antecedent variables. In the most basic
model, cause (X) leads to effect (Y). But there could be a third variable (Z) that influences
(Y), and X might not be the true cause at all. Z is said to be a spurious variable and must be
controlled for. The same is true for intervening variables (a variable in between the supposed
cause (X) and the effect (Y)), and anteceding variables (a variable prior to the supposed cause
(X) that is the true cause). When a third variable is involved and has not been controlled for,
the relation is said to be a zero order relationship. In most practical applications of
experimental research designs there are several causes (X1, X2, X3). In most designs, only
one of these causes is manipulated at a time.
Experimental designs after Fisher
Some efficient designs for estimating several main effects were found independently and in
near succession by Raj Chandra Bose and K. Kishen in 1940 at the Indian Statistical Institute,
but remained little known until the Plackett-Burman designs were published in Biometrika in
1946. About the same time, C. R. Rao introduced the concepts of orthogonal arrays as
experimental designs. This concept played a central role in the development of Taguchi
methods by Genichi Taguchi, which took place during his visit to Indian Statistical Institute in
early 1950s. His methods were successfully applied and adopted by Japanese and Indian
industries and subsequently were also embraced by US industry albeit with some reservations.
In 1950, Gertrude Mary Cox and William Gemmell Cochran published the book Experimental
Designs, which became the major reference work on the design of experiments for
statisticians for years afterwards.
Developments of the theory of linear models have encompassed and surpassed the cases that
concerned early writers. Today, the theory rests on advanced topics in linear algebra, algebra
and combinatorics.
As with other branches of statistics, experimental design is pursued using both frequentist and
Bayesian approaches: In evaluating statistical procedures like experimental designs,
frequentist statistics studies the sampling distribution while Bayesian statistics updates a
probability distribution on the parameter space.
14
Some important contributors to the field of experimental designs are C. S. Peirce, R. A.
Fisher, F. Yates, C. R. Rao, R. C. Bose, J. N. Srivastava, Shrikhande S. S., D. Raghavarao, W.
G. Cochran, O. Kempthorne, W. T. Federer, V. V. Fedorov, A. S. Hedayat, J. A. Nelder, R. A.
Bailey, J. Kiefer, W. J. Studden, A. Pázman, F. Pukelsheim, D. R. Cox, H. P. Wynn, A. C.
Atkinson, G. E. P. Box and G. Taguchi. The textbooks of D. Montgomery and R. Myers have
reached generations of students and practitioners.
Human participant experimental design constraints
Laws and ethical considerations preclude some carefully designed experiments with human
subjects. Legal constraints are dependent on jurisdiction. Constraints may involve institutional
review boards, informed consent and confidentiality affecting both clinical (medical) trials
and behavioral and social science experiments. In the field of toxicology, for example,
experimentation is performed on laboratory animals with the goal of defining safe exposure
limits for humans. Balancing the constraints are views from the medical field. Regarding the
randomization of patients, "... if no one knows which therapy is better, there is no ethical
imperative to use one therapy or another." (p 380) Regarding experimental design, "...it is
clearly not ethical to place subjects at risk to collect data in a poorly designed study when this
situation can be easily avoided...". (p 393)
Randomized experiment
Flowchart of four phases (enrollment, intervention allocation, follow-up, and data analysis) of
a parallel randomized trial of two groups, modified from the CONSORT 2010 Statement
In science, randomized experiments are the experiments that allow the greatest reliability
and validity of statistical estimates of treatment effects. Randomization-based inference is
especially important in experimental design and in survey sampling.
15
Overview
In the statistical theory of design of experiments, randomization involves randomly allocating
the experimental units across the treatment groups. For example, if an experiment compares a
new drug against a standard drug, then the patients should be allocated to either the new drug
or to the standard drug control using randomization.
Randomized experimentation is not haphazard. Randomization reduces bias by equalising
other factors that have not been explicitly accounted for in the experimental design (according
to the law of large numbers). Randomization also produces ignorable designs, which are
valuable in model-based statistical inference, especially Bayesian or likelihood-based. In the
design of experiments, the simplest design for comparing treatments is the "completely
randomized design". Some "restriction on randomization" can occur with blocking and
experiments that have hard-to-change factors; additional restrictions on randomization can
occur when a full randomization is infeasible or when it is desirable to reduce the variance of
estimators of selected effects.
Randomization of treatment in clinical trials pose ethical problems. In some cases,
randomization reduces the therapeutic options for both physician and patient, and so
randomization requires clinical equipoise regarding the treatments.
Online Randomized Controlled Experiments
Web sites can run randomized controlled experiments to create a feedback loop. Key
differences between offline experimentation and online experiments include:





Logging: user interactions can be logged reliably.
Number of users: large sites, such as Amazon, Bing/Microsoft, and Google run
experiments, each with over a million users.
Number of concurrent experiments: large sites run tens of overlapping, or concurrent,
experiments.
Robots, whether web crawlers from valid sources or malicious internet bots.
Ability to ramp-up experiments from low percentages to higher percentages.
Speed / performance has significant impact on key metrics.

Ability to use the pre-experiment period as an A/A test to reduce variance.

History
The earliest controlled experiments appears to have been suggested in the Old Testament's
Book of Daniel. King Nebuchadnezzar proposed that some Israelites eat "a daily amount of
food and wine from the king's table." Daniel preferred a vegetarian diet, but the official was
concerned that the king would "see you looking worse than the other young men your age?
The king would then have my head because of you." Daniel then proposed the following
16
controlled experiment: "Test your servants for ten days. Give us nothing but vegetables to eat
and water to drink. Then compare our appearance with that of the young men who eat the
royal food, and treat your servants in accordance with what you see” (Daniel 1, 12– 13).
Randomized experiments were institutionalized in psychology and education in the late
eighteen-hundreds, following the invention of randomized experiments by C. S. Peirce.
Outside of psychology and education, randomized experiments were popularized by R.A.
Fisher in his book Statistical Methods for Research Workers, which also introduced additional
principles of experimental design.
Statistical Interpretation
The Rubin Causal Model provides a common way to describe a randomized experiment.
While the Rubin Causal Model provides a framework for defining the causal parameters (i.e.,
the effects of a randomized treatment on an outcome), the analysis of experiments can take a
number of forms. Most commonly, randomized experiments are analyzed using ANOVA,
Student's t-test, Regression analysis, or a similar statistical test.
Empirical evidence that randomization makes a difference
Empirically differences between randomized and non-randomized studies, and between
adequately and inadequately randomized trials have been difficult to detect.
Random assignment
RANDOM ASSIGNMENT OR RANDOM PLACEMENT is an experimental technique for assigning
human participants or animal subjects to different groups in an experiment (e.g., a treatment
group versus a control group) using randomization, such as by a chance procedure (e.g.,
flipping a coin) or a random number generator. This ensures that each participant or subject
has an equal chance of being placed in any group. Random assignment of participants helps to
ensure that any differences between and within the groups are not systematic at the outset of
the experiment. Thus, any differences between groups recorded at the end of the experiment
can be more confidently attributed to the experimental procedures or treatment.
Random assignment, blinding, and controlling are key aspects of the design of experiments,
because they help ensure that the results are not spurious or deceptive via confounding. This
is why randomized controlled trials are vital in clinical research, especially ones that can be
double-blinded and placebo-controlled.
Mathematically, there are distinctions between randomization, pseudorandomization, and
quasirandomization, as well as between random number generators and pseudorandom
number generators. How much these differences matter in experiments (such as clinical trials)
is a matter of trial design and statistical rigor, which affect evidence grading. Studies done
without randomization include quasi-experiments and open-label trials; those done with
17
pseudo- or quasirandomization are usually given nearly the same weight as those with true
randomization but are viewed with a bit more caution.
Benefits of random assignment
Imagine an experiment in which the participants are not randomly assigned; perhaps the first
10 people to arrive are assigned to the Experimental Group, and the last 10 people to arrive
are assigned to the Control group. At the end of the experiment, the experimenter finds
differences between the Experimental group and the Control group, and claims these
differences are a result of the experimental procedure. However, they also may be due to
some other preexisting attribute of the participants, e.g. people who arrive early versus people
who arrive late.
Imagine the experimenter instead uses a coin flip to randomly assign participants. If the coin
lands heads-up, the participant is assigned to the Experimental Group. If the coin lands tailsup, the participant is assigned to the Control Group. At the end of the experiment, the
experimenter finds differences between the Experimental group and the Control group.
Because each participant had an equal chance of being placed in any group, it is unlikely the
differences could be attributable to some other preexisting attribute of the participant, e.g.
those who arrived on time versus late.
Potential issues
Random assignment does not guarantee that the groups are matched or equivalent. The groups
may still differ on some preexisting attribute due to chance. The use of random assignment
cannot eliminate this possibility, but it greatly reduces it.
To express this same idea statistically - If a randomly assigned group is compared to the mean
it may be discovered that they differ, even though they were assigned from the same group. If
a test of statistical significance is applied to randomly assigned groups to test the difference
between sample means against the null hypothesis that they are equal to the same population
mean (i.e., population mean of differences = 0), given the probability distribution, the null
hypothesis will sometimes be "rejected," that is, deemed not plausible. That is, the groups will
be sufficiently different on the variable tested to conclude statistically that they did not come
from the same population, even though, procedurally, they were assigned from the same total
group. For example, using random assignment may create an assignment to groups that has 20
blue-eyed people and 5 brown-eyed people in one group. This is a rare event under random
assignment, but it could happen, and when it does it might add some doubt to the causal agent
in the experimental hypothesis.
18
Random Sampling
Random sampling is a related, but distinct process. Random sampling refers to recruiting
participants in a way that they represent a larger population. Because most basic statistical
tests require the hypothesis of an independent randomly sampled population, random
assignment is the desired assignment method because it provides control for all attributes of
the members of the samples—in contrast to matching on only one or more variables—and
provides the mathematical basis for estimating the likelihood of group equivalence for
characteristics one is interested in, both for pretreatment checks on equivalence and the
evaluation of post treatment results using inferential statistics. More advanced statistical
modeling can be used to adapt the inference to the sampling method.
History
Randomization was emphasized in the theory of statistical inference of Charles S. Peirce in
"Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference"
(1883). Peirce applied randomization in the Peirce-Jastrow experiment on weight perception.
Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to
evaluate their ability to discriminate weights. Peirce's experiment inspired other researchers in
psychology and education, which developed a research tradition of randomized experiments
in laboratories and specialized textbooks in the eighteen-hundreds.
Jerzy Neyman advocated randomization in survey sampling (1934) and in experiments
(1923). Ronald A. Fisher advocated randomization in his book on experimental design (1935).
Replication (statistics)
In engineering, science, and statistics, replication is the repetition of an experimental
condition so that the variability associated with the phenomenon can be estimated. ASTM, in
standard E1847, defines replication as "the repetition of the set of all the treatment
combinations to be compared in an experiment. Each of the repetitions is called a replicate."
Replication is not the same as repeated measurements of the same item: they are dealt with
differently in statistical experimental design and data analysis.
For proper sampling, a process or batch of products should be in reasonable statistical control;
inherent random variation is present but variation due to assignable (special) causes is not.
Evaluation or testing of a single item does not allow for item-to-item variation and may not
represent the batch or process. Replication is needed to account for this variation among items
and treatments.
19
Example
As an example, consider a continuous process which produces items. Batches of items are
then processed or treated. Finally, tests or measurements are conducted. Several options might
be available to obtain ten test values. Some possibilities are:




One finished and treated item might be measured repeatedly to obtain ten test results.
Only one item was measured so there is no replication. The repeated measurements
help identify observational error.
Ten finished and treated items might be taken from a batch and each measured once.
This is not full replication because the ten samples are not random and not
representative of the continuous nor batch processing.
Five items are taken from the continuous process based on sound statistical sampling.
These are processed in a batch and tested twice each. This includes replication of
initial samples but does not allow for batch-to-batch variation in processing. The
repeated tests on each provide some measure and control of testing error.
Five items are taken from the continuous process based on sound statistical sampling.
These are processed in five different batches and tested twice each. This plan includes
proper replication of initial samples and also includes batch-to-batch variation. The
repeated tests on each provide some measure and control of testing error.
Each option would call for different data analysis methods and yield different conclusions.
Factorial experiment
In statistics, a full factorial experiment is an experiment whose design consists of two or
more factors, each with discrete possible values or "levels", and whose experimental units
take on all possible combinations of these levels across all such factors. A full factorial
design may also be called a fully crossed design. Such an experiment allows the investigator
to study the effect of each factor on the response variable, as well as the effects of interactions
between factors on the response variable.
For the vast majority of factorial experiments, each factor has only two levels. For example,
with two factors each taking two levels, a factorial experiment would have four treatment
combinations in total, and is usually called a 2×2 factorial design.
If the number of combinations in a full factorial design is too high to be logistically feasible, a
fractional factorial design may be done, in which some of the possible combinations (usually
at least half) are omitted.
20
History
Factorial designs were used in the 19th century by John Bennet Lawes and Joseph Henry
Gilbert of the Rothamsted Experimental Station.
Ronald Fisher argued in 1926 that "complex" designs (such as factorial designs) were more
efficient than studying one factor at a time.
Fisher wrote,
"No aphorism is more frequently repeated in connection with field trials, than that we must
ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this
view is wholly mistaken."
Nature, he suggests, will best respond to "a logical and carefully thought out questionnaire".
A factorial design allows the effect of several factors and even interactions between them to
be determined with the same number of trials as are necessary to determine any one of the
effects by itself with the same degree of accuracy.
Frank Yates made significant contributions, particularly in the analysis of designs, by the
Yates analysis.
The term "factorial" may not have been used in print before 1935, when Fisher used it in his
book The Design of Experiments.
Example
The simplest factorial experiment contains two levels for each of two factors. Suppose an
engineer wishes to study the total power used by each of two different motors, A and B,
running at each of two different speeds, 2000 or 3000 RPM. The factorial experiment would
consist of four experimental units: motor A at 2000 RPM, motor B at 2000 RPM, motor A at
3000 RPM, and motor B at 3000 RPM. Each combination of a single level selected from
every factor is present once.
This experiment is an example of a 22 (or 2x2) factorial experiment, so named because it
considers two levels (the base) for each of two factors (the power or superscript), or
#levels#factors, producing 22=4 factorial points.
21
Designs can involve many independent variables. As a further example, the effects of three
input variables can be evaluated in eight experimental conditions shown as the corners of a
cube.
This can be conducted with or without replication, depending on its intended purpose and
available resources. It will provide the effects of the three independent variables on the
dependent variable and possible interactions.
Notation
2×2 factorial experiment
A
B
(1)
−
−
a
+
−
b
−
+
ab
+
+
The notation used to denote factorial experiments conveys a lot of information. When a
design is denoted a 23 factorial, this identifies the number of factors (3); how many levels
each factor has (2); and how many experimental conditions there are in the design (23=8).
Similarly, a 25 design has five factors, each with two levels, and 25=32 experimental
conditions; and a 32 design has two factors, each with three levels, and 32=9 experimental
conditions.
Factorial experiments can involve factors with different numbers of levels. A 243 design has
five factors, four with two levels and one with three levels, and has 16 X 3=48 experimental
conditions.
22
To save space, the points in a two-level factorial experiment are often abbreviated with strings
of plus and minus signs. The strings have as many symbols as factors, and their values dictate
the level of each factor: conventionally, for the first (or low) level, and for the second (or
high) level. The points in this experiment can thus be represented as
,
,
, and
.
The factorial points can also be abbreviated by (1), a, b, and ab, where the presence of a letter
indicates that the specified factor is at its high (or second) level and the absence of a letter
indicates that the specified factor is at its low (or first) level (for example, "a" indicates that
factor A is on its high setting, while all other factors are at their low (or first) setting). (1) is
used to indicate that all factors are at their lowest (or first) values.
Implementation
For more than two factors, a 2k factorial experiment can be usually recursively designed from
a 2k-1 factorial experiment by replicating the 2k-1 experiment, assigning the first replicate to
the first (or low) level of the new factor, and the second replicate to the second (or high) level.
This framework can be generalized to, e.g., designing three replicates for three level factors,
etc.
A factorial experiment allows for estimation of experimental error in two ways. The
experiment can be replicated, or the sparsity-of-effects principle can often be exploited.
Replication is more common for small experiments and is a very reliable way of assessing
experimental error. When the number of factors is large (typically more than about 5 factors,
but this does vary by application), replication of the design can become operationally difficult.
In these cases, it is common to only run a single replicate of the design, and to assume that
factor interactions of more than a certain order (say, between three or more factors) are
negligible. Under this assumption, estimates of such high order interactions are estimates of
an exact zero, thus really an estimate of experimental error.
When there are many factors, many experimental runs will be necessary, even without
replication. For example, experimenting with 10 factors at two levels each produces 210=1024
combinations. At some point this becomes infeasible due to high cost or insufficient
resources. In this case, fractional factorial designs may be used.
As with any statistical experiment, the experimental runs in a factorial experiment should be
randomized to reduce the impact that bias could have on the experimental results. In practice,
this can be a large operational challenge.
Factorial experiments can be used when there are more than two levels of each factor.
However, the number of experimental runs required for three-level (or more) factorial designs
will be considerably greater than for their two-level counterparts. Factorial designs are
therefore less attractive if a researcher wishes to consider more than two levels.
23
Analysis
A factorial experiment can be analyzed using ANOVA or regression analysis. It is relatively
easy to estimate the main effect for a factor. To compute the main effect of a factor "A",
subtract the average response of all experimental runs for which A was at its low (or first)
level from the average response of all experimental runs for which A was at its high (or
second) level.
Other useful exploratory analysis tools for factorial experiments include main effects plots,
interaction plots, and a normal probability plot of the estimated effects.
When the factors are continuous, two-level factorial designs assume that the effects are linear.
If a quadratic effect is expected for a factor, a more complicated experiment should be used,
such as a central composite design. Optimization of factors that could have quadratic effects is
the primary goal of response surface methodology.
24
Download