Chapter 6: Basics of Experimentation

Chapter 6: Basics of Experimentation
Experiment—A test designed to arrive at a causal
explanation (Cook & Campbell, 1979)
Mill (1843)—Joint method of agreement and difference:
causation can be inferred if some result, X, follows an event,
A, if A and X vary together and it can be shown that event
A produces result X
 If A occurs, then so will X, and if A does not occur, then
neither will X
If event B occurs, then X does not occur
Chapter 6 continued:
Tip-of-the-Tongue (TOT) example: X = correct resolution of the TOT state, A
= presenting letter initials, B = repeat question, C = present picture of
Subjects were instructed to name celebrities, and 10.5 instances per subject
resulted in TOT states (the name of the celebrity was on the “tip of the subjects
tongue,” but they could not actually recall it)
Subjects showed significantly better resolution to TOT states with letter initials as a cue
compared to either repeating the question cue or presenting a picture of the celebrity
This suggests that memory for celebrities is coded using letter-level orthographic
information rather than visual or “data warehouse” related information codes
However, we are not told whether conditions B or C differed significantly from chance—if
they are above chance, then this suggests that this type of coding occurs, but is less common
than orthographic coding
Also, in so-called TOT states, it could have been that the unresolved cases were actually due
to subjects no knowing the name of the celebrity
Chapter 6 continued:
Joint Method of Agreement and Difference continued:
Note that in the real world of science, A does not always produce X, and
the absence of event A does not always fail to produce X (because
science is inductive, or probabilistic rather than deductive)
Thus, the inductive version of Mill’s joint method of agreement and
difference is that Event A (presenting letter initials) produces significantly
more resolution of X than event B (repeating the question) or event C
(presenting a picture of the celebrity)
So, “more” is defined by statistical significance
Statistical significance tests whether two (or more) means differ even when we
consider error variance (noise)
This is why we refer to statistics as the “language of science”
Chapter 6 continued:
In many experiments in psychology, you compare a neutral baseline (e.g., repeating
a question in our TOT example—although if this question was coded as a contextual
cue with the memory for the celebrity’s name, then this would not have been a
neutral baseline!)
The experimental condition (presenting celebrity’s initials) should show a significantly
larger effect on the DV (percent recall of the celebrity’s name) than the control
Experimental control is central to an experiment because it allows the production of
a comparison by controlling the occurrence or nonoccurrence of a variable (while
holding all other possible causes constant so they cannot affect the outcome)
Control has three components:
Comparison (the control condition is used as a comparison)
Production (levels of values of the IV can be produced)
Constancy (the experimental setting can be controlled by holding certain aspects constant)
Chapter 6 continued:
Advantages of Experimentation:
Using animal models can sometimes save money and is considered to be
more ethical by many
E.g., cosmetics are frequently tested on rabbits
But this has been very controversial!
Experimentation has more control than ex post facto research in which
levels of the IV are selected after the fact (selected rather than
manipulated with control)
E.g., research on the health consequences of smoking has been almost entirely
The Tobacco lobby actually used as a defense in trial that cigarette smokers were
more likely to develop lung cancer than non-smokers because smokers were more
neurotic—and it was really the higher levels of neuroticism that were causing the
cancer risk!
 One cannot rule out this because neuroticism was not controlled
Chapter 6 continued:
Variables in Experimentation:
IVs—are manipulated by the experimenter because they are hypothesized to cause
changes on the DV
Failure to find an effect of the IV on the DV is termed a “null result”
DV—the performance variable observed and recorded by the experimenter. A good DV
(e.g., RT or accuracy) should be reliable and should not be overly sensitive to floor or
ceiling effects
This can be due to either a lack of an effect, an invalid manipulation, or a lack of statistical power
Floor effect—when it is impossible to do any worse on a task because you are already at the
Ceiling effect—when it is impossible to improve because you already are at perfect performance
CV—potential IVs that are held constant during an experiment
This is usually because one can only manipulate a small number of variables (usually five or fewer)
in any given experiment
If a potential IV that is not manipulated is not controlled, then it can become a confounded variable
Chapter 6 continued:
Review four examples of experiments from the text
Chapter 6 continued:
More than one IV: A typical experiment will
manipulate between 2-4 IVs
This is done because it is more efficient—experimental
control is typically superior with multiple IVs, and the results
can be generalized across a group of IVs rather than just a
single IV
 Multiple IVs also allow a researcher to examine both main
effects (an effect of just one IV in isolation) and interactions
(when the effects produced by are not the same across the
levels of a second IV)
Interactions allow us to examine joint effects of multiple IVs and add
increased precision
An interaction takes precedence over main effects
Chapter 6 continued:
More than one DV: we analyze just one DV at a time in univariate statistics
If we truly wish to analyze two of more DVs at a time, this is a multivariate
statistical technique (such as MANOVA)
In a MANOVA, we form a composite DV form from multiple DVs
But MANOVAs do not tell us whether the pattern of effects are consistent
across DVs
We can use correlations across trial blocks, diffusion models, or entropy/RT
models to look at the overall pattern of results across multiple DVs (e.g., RT and
However, these techniques are complicated
Consequently, most experiments in psychology use a single DV and are analyzed using
Chapter 6 continued:
Possible sources of experimental error:
 Demand
characteristics or reactivity—Hawthorne effect
(Homans, 1965)
 Deception can be used to prevent demand
 Because
subjects do not know what is being tested, they
cannot be biased through reactivity
 However, if an experimenter uses deception, they typically
need to debrief participants after the study
Chapter 6 continued:
External validity of the research procedure:
Representativeness of subjects—the ability to generalize
across different participant populations
Are rats really representative of humans?
Variable representativeness—the ability to generalize
across different experimental manipulations
E.g., rats’ basal ganglia system is probably different from that of
E.g., the relationship of background noise to studying efficiency (do
noise and music both impair performance)
Setting representativeness—the representativeness of the
experimental setting (or ecological validity)
Realism is not the same as generalizability, though
Chapter 7: Validity and Reliability in
Psychological Research
Validity—the truth of an observation
Types of Validity:
Predictive validity—checking the truth of an observation by comparing it to
another criterion that is thought to measure the same thing
We will use SAT I as an example
Criterion—another measurement of behavior that serves as a standard for the
measurement in question (e.g., ACT, college freshman GPA)
In predictive validity, the relation between two scores is typically assessed by a statistic
termed the correlation coefficient (e.g., Pearson’s product-moment correlation
The better the prediction of the observation (e.g., SAT I score predicting college
freshman GPA), the greater the predictive validity of the predictor score
However, predictive validity does not define a measure or construct
E.g., We cannot assume that a person with a higher SAT I score than another person is
smarter than the other person because predictive validity does not allow us to do this unless
our criterion is that sort of measurement
 E.g., an intelligence test score rather than freshman GPA
Chapter 7 continued:
Types of Validity continued:
Construct Validity—the degree to which the independent and dependent variables
accurate reflect or measure what they are intended to measure (Cook & Campbell, 1979;
Judd et al., 1991)—really, are the names accurate?
In our Stroop experiment from Chapter 1, did our tasks really reflect reading and does scan really
reflect reading performance?
Extraneous Variables—confounding variables that may be a source of invalidity can threaten
construct validity
Counting the number of digits in a row is probably not a good measure of reading
Reading aloud requires speech production processes that are not required in reading, and Tasks 2 and 3
required counting which is not the same as reading
Katz et al. (1990) have also claimed that the SAT I is not construct valid
Freedle and Kostin (1994) found that SAT test takers did use the passages to respond, so they found some
construct validity
Reactivity and Random Error
Subjects could have been afraid of looking like a poor reader on a Stroop task
Some subjects could have been tested with a second hand on a watch, and others could have been timed
with a chronograph (a stopwatch), this could have led to random error in timing precision
Chapter 7 continued:
Construct Validity continued:
We can improve construct validity by using an operational definition (a recipe for
specifying how a construct, such as reading, is produced and measured)
This is because operational definitions allow the conditions that produce the concept to be measured
and defined
In our Stroop example, reading is reduced to the independent variables that produce it and the dependent
variable(s) that that is used to measure it
Protocols—the specification of how the measurement and procedures are to be
undertaken—also reduce the risk of construct invalidity because they reduce the likelihood
of random error
Circular reasoning is a potential problem when using an operational definition, though
We need to have a method of defining something independent of how we measure it
Some have claimed that the concept of processing resources suffers from this problem (circularity,
Navon, 1979)
However, we can use PRP and coactivation methods
Chapter 7 continued:
Construct validity is usually demonstrated using
psychometric methods:
Factor Analysis
A data reduction method in which you determine which measured
variables are related to which constructs
You can also show that you constructs from factor analysis are related
in the manner predicted by your theory using causal analysis:
Path analysis or Structural Equation Modeling (or covariance structure
Item Response Theory (or IRT)—is a mathematical technique
for determining which items on a test measure the same
Chapter 7 continued:
Types of Validity continued:
External Validity—the extent that we can generalize our
research results (in this setting measured on this sample) to
other settings and other populations or samples
To demonstrate external validity, we need to replicate our initial
results in other settings and on different people
Hypertension, gender and race
Our experimental setting needs to be representative of the typical
situation (e.g., reading is typically tested using a reading out-loud
method in elementary school even though this is not an accurate
measure of reading comprehension—it is more of a measure of
speech perception or production)
Chapter 7 continued:
Internal Validity—when we can make causal statements about
the relationship between IVs and DVs
Specifically, when your IV causes an effect on the DV (are we testing
what we claim to be testing—although this can be similar to construct
Without internal validity, we are not doing science
Internal validity requires good experimental control
This is at odds with external validity because as we increase experimental
control, our results become less generalizable!
A major challenge in science is to maximize both internal and external
validity even though they are negatively correlated
We can do this by keeping good experimental control and by comparing our
results across multiple samples with large sample sizes
Chapter 7 continued:
Reliability—the consistency of behavioral measures
Types of Reliability:
Test-Retest: giving the same test twice in succession over a short time
interval in order to measure consistency (using a correlation coefficient
to measure consistency)
Parallel Forms: giving two versions of a test on two testing occasions to
determine whether they result in consistent scores
Split-Half: dividing test items from a single test into two arbitrary groups
and correlating the resulting scores after administration—if the
correlation is sufficiently high, then test reliability is confirmed (this also
establishes the equivalency of your test items)
Chapter 7 continued:
Statistical Reliability and Validity:
Statistical Reliability determines whether findings are the result of chance
If not, we assume that the results occur because of the effect of the IV(s) on the DV
We sample subjects from a population when we use inferential statistics
Statistical validity is whether we are measuring what we claim to be measuring
The sample size needs to be large enough in order for the sample to estimate its
underlying population(s)
The Central Limit Theorem states that samples of 20-30 allow us to assume that a
sample estimates the shape of the population
Increasing sample size typically increases statistical power—the ability to reject
a false null hypothesis
Random Sampling increases the likelihood that the obtained sample does
estimate accurately the characteristics of the population that it is attempting to
Chapter 7 continued:
Types of errors in inferential statistics:
 Type
I error—the probability of rejecting a true null
hypothesis (the alpha level)
 Type II error—when you fail to reject a false null
 1-probability
of a Type II error = power
Chapter 7 continued:
Measurement procedures—a systematic method of
assigning numbers or names to objects and their
Nominal scale—labels with no quantitative significance
 Ordinal scale—measures differences in magnitude (ranks),
but not how much
 Interval scale—measures differences magnitude as well as
how much different
 Ratio scale—same as interval except with an added
absolute zero—so you can determine how many times
greater something is
Chapter 8: Experimental Design
Internal Validity in Experiments—by using experimental
control, the researcher can rule out confounding variables as a
cause, so that one’s results really do reflect an effect of the IV
on the DV
Internal validity requires careful selection of IVs and a well thought-out
experimental design
You can never “fix” design problems are the analysis stage
Although you can use “statistical control” through the use of ANCOVA
In this chapter, we will discuss two main types of experimental designs—
between subjects and within subjects
Between Subjects—independent groups of subjects receive the different
levels of the IV
Within Subjects—all subjects receive all levels of the IV
Chapter 8 continued:
Crossed versus Nested designs:
A crossed design is a factorial design—there are no empty
 A nested design is when subjects receive different levels of
the IV
You have empty cells
You only use this design in special situations because you cannot
interpret interactions
You might use a nested design to save money when only certain cells are
of interest
A placebo design is nested
 But you can treat this as a crossed design—see example
Chapter 8 continued:
Why experimental design matters and how even with the best of intentions
you must be very careful in interpreting your results
Example of a between subjects design: Executive Monkeys—Brady (1958) found
that “executive monkeys” in control of when they were shocked were more likely
to develop ulcers than “blue-collar” monkeys that had no control over when they
were shocked
However, Weiss (1968,1971) found that executive rates that had control over when an
electric shock was administered were less likely to develop ulcers than helpless rats that
had no control over when electric shocks were administered (this is an example of
learned helplessness
The discrepancy occurred because Brady randomly assigned high response-rate
monkeys to the executive monkey condition (“neurotic monkeys”)—an individual
The moral of the story is that individual differences are ALWAYS confounded with IV effects
in a between subjects design
 With large sample sizes, hopefully this would not occur
 Also, replication is essential to catch these errant results
Chapter 8 continued:
To see if the animal results of the effect of unavoidable stress
on performance generalizes to humans, many researchers look
at the effect of different stressors on cortisol (a stress hormone)
Meta-analysis (Dickerson & Kemey, 2004) has shown that cognitive tasks
(e.g., mental arithmetic) and public speaking cause cortisol levels to rise,
but that noise exposure and emotion induction do not
So, stress does increase cortisol levels in humans as well as non-human
Chronically high levels of cortisol can cause cell death in the hippocampus
and amygdala
Chapter 8 continued:
Example of a within subjects design: experiments with LSD
Jarrard (1963) looked at the dose response curve of LSD on rats (by
looking at the rate of lever pressing with salt water being the control)
Jarrard counterbalanced the order the dose (.05, ,.10, .20, .40, .80 milligram
per kilogram of body weight)
Jarrard found that the two smallest doses slightly enhanced the response rate
but that the two highest doses severely impaired response rate
One problem with drug studies using a within subjects design is that carryover
effects may be so strong that counterbalancing cannot correct them
 So, you may need to use a between subjects design for this type of study
Chapter 8 continued:
Types of Experimental Designs:
Between-subjects—a conservative design that prevents
carryover effects (by using different subjects for different
levels of the IV)
However, this design is extremely susceptible to individual differences
confounding results
In order to minimize individual differences confounding one’s results, one
can use matching (important subject characteristics are matched in the
various treatment conditions) and randomization (random assignment)
 However, subject attrition can make matching difficult, although
newer mixed models can be used to analyze the data with missing
data points
Chapter 8 continued:
Within subjects designs—are more efficient and control for
individual differences (because each subject serves as their
own control), but this design is sensitive to carryover effects
(e.g., practice and fatigue effects)
Counterbalancing can help minimize carryover effects
Factorial counterbalancing is the most comprehensive method, although it may
not be practical (go over factorials)
A Latin square design can simplify counterbalancing
Balanced Latin square: for an even number of conditions: 1, 2, n, 3, n-1, 4, n-2 …
 For an odd number of conditions, two squares are needed (the one above
and a second reversed square)
 Another option is to use a modular counterbalancing scheme (n-1)
Chapter 8 continued:
Control condition—in its simplest form, a group that
does not receive a treatment
It is a baseline against which some other variable in the
experiment can be compared
Mixed designs—when you have at least one between
subjects variable and at least one within subjects
Choosing an experimental design: Issues to consider
Carryover effects in a within subjects design
 Individual differences in a between subjects design
Chapter 9: Complex Designs
Factorial Designs—we use these complex designs
because real-world information processing is
complex and requires multiple IVs
 As
we begin to understand a phenomenon better, the
complexity of our experiments tends to increase from
single IVs to many IVs
Chapter 9 continued:
Main Effects and Interactions
Color (hue), Case Type, and Spacing in visual word
If we use a fast achromatic (magnocellular) channel and two slower
parvocellular (one chromatic and one achromatic) channels to
recognize words on a lexical decision task, then we should see a
different pattern of hue effects for consistent lowercase versus mixedcase presentation
If this effect is due to the channel dynamics mentioned above, it should
be relatively consistent for spaced and unspaced words
Chapter 9 continued: Experiment 1
Chapter 9 continued: Experiment 4
Chapter 9 continued:
Main Effects: when we look at the effect of one IV collapsed
across all other IVs
Interaction: when the effects of one IV depend upon the levels
of another IV
In our case, a main effect for case type
In our case, a Case Type x Hue Type interaction, but no three-way
Because interactions typically qualify main effects, if you have
an interaction, then you need to make sure that the interaction
does not attenuate or eliminate your main effects
Control in between subjects designs: random-groups and
matched-groups designs
Chapter 9 continued:
Complex within subjects designs (such as our
example above)
 Block
randomization or complete randomization (we
used complete randomization)
 Mixed designs: when you have at least one between
subjects variable and at least one within subjects
Chapter 10: Small-n Experimentation
Small-n Experimentation—when a very few subjects are studies intensely
This design framework is often used for non-human animal research (because of
the expense and logistic complexity of testing large numbers of, say, rats)
It is also used for special populations of humans that are difficult to obtain (e.g.,
progeria cases) and for clinical populations (e.g., ADHD children or patients with
peculiar brain damage—such as H.M.—that can be difficult to obtain because
of privacy issues—although this is of questionable validity because there are
costs to using this approach)
The main cost in using small-n designs is that you are using descriptive statistics
That is, you are not obtaining a sample and assuming that this sample estimates a
population—you are simply describing this small group of individuals
You must be very careful in assuming that these results generalize to a population, as a whole
Chapter 10 continued:
Types of Small-n designs:
The AB design—A represents a baseline condition before,
say, therapy (the control condition of the IV) and B
represents the condition after the introduction of therapy
(the treatment condition of the IV)
This design is used in some research—although it is a very poor
design because changes that occur during treatment in the B phase
may be caused by other uncontrolled variables that are confounded
with therapy in that they really cause the change on the DV
E.g., development (the passage of time during which we mature)
Chapter 10 continued:
Small-n designs continued:
ABA (or ABAB) or reversal
design—a design in which
there are interspersed baseline
(A) and treatment (B) phases of
This design rules out maturation,
so it is superior to an AB design
Chapter 10 continued:
Small-n designs continued:
Before an ABA design is used, usually researchers use a “functional
analysis of behavior” (a la Skinner) approach to better understand the
phenomenon of interest
In a functional analysis of behavior study, one attempts to discover the
antecedents and consequences of a given behavior in considerable
Functional relationship—the functional relation between what leads to the
target behavior and the consequences that it produces
The Contingency—the relationship between the behavior and the outcome
(includes reinforcement, punishment, escape, and avoidance)
The Discriminative Behavior—the controlling stimulus or stimuli that cause the
unwanted behavior
Chapter 10 continued:
Small-n designs continued:
Alternating Treatments Design (ACABCBCB) (A = no treatment, B =
cookie with no dye, C= cookie with dye that potentially causes
hyperactivity)—more than one IV is used, and there may be numerous
baseline periods
This design extends the ABAB design because it allows multiple IVs (or at
least a control condition)
However, it does not work well when carryover effects are present with some
or all of the IVs (but the same holds for an ABAB design)
In the Rose (1978) study, the two hyperactive girls showed no difference
between A and B, but they did show more hyperactivity in the C condition—
suggesting that the dye caused the increase in hyperactivity rather than the
cookie, per se
Chapter 10 continued:
Small-n designs continued:
The Multiple-Baseline design—can be used with a between-subjects design to
overcome carryover effects—several behaviors (within subjects) or several
people (between subjects) receive baseline periods of varying length, after
which the IV is introduced (you can also look across settings)
One behavior is allowed to occur under baseline conditions (e.g., crying) and then the
experimenter switches to the treatment
The timing of the onset of treatment is varied across subjects—if the treatment consistently is
associated with a change in behavior (when other potential causes are held constant), then it
is assumed that the treatment caused the change in behavior
You can use this same approach with the same subjects across different behaviors with
different timing of the onset of the treatment—if the treatment for crying reduces crying but
it does not affect fighting (and vice versa), then you can assume that your treatment caused
the change in behavior
Chapter 10 continued:
Small-n designs continued:
The changing-criterion design—a method in which the researcher
changes the behavior necessary to obtain reinforcement
If the behavior changes systematically with the changing criteria (e.g., you
have to ride 5 miles instead of 3 miles on a stationary bike to get bonus
points), then one assumes that the reinforcement criteria are producing the
That is, if the experimenter removes the incentive completely (e.g., points that
can be used to buy video games if 11-year-old boys exercise, DeLuca &
Holborn, 1992), the level of exercise decreases back to zero
Note that if people base behavior on just external rewards, then this is not a
good situation (e.g., if children do not clean their room unless they get paid to do
so, then their house will be in bad chape when they are an adult)
Chapter 10 continued:
Clinical Psychology—case studies: typically based on
one patient with a disorder (e.g., H.M.)
Nissen et al.’s (1988) study of a dissociative identity
disorder (multiple personality disorder) patient using
memory tasks
This study is interesting because the explicit task showed an effect but
the implicit task did not—contrary to the authors’ interpretation, it
could have been that the DID patient was simply not able to catch the
automatic processing but he could the processing of which he was
consciously aware