Interaction of the causal relationship with units

advertisement
Effects in Experiments: Simulating
the Counterfactual
•
•
•
Effects: We define an effect in terms of a notion of the
counterfactual. If in an experiment you observe what does
happen as a result of the application of a treatment (e.g.,
condition of the independent variable), the counterfactual is
what would have happened if the treatment had
simultaneously not been applied (for example, if the same
subjects simultaneously had and had not heard the
message/seen the video/etc.
Obviously you can’t treat and not treat the same subjects to
two conditions simultaneously, so through randomization
or matching cases you attempt to simulate the
counterfactual by approximating the “not treated” under
conditions as close as possible to the “treated.” In short you
try to simulate in the experiment the effects of observing the
presence and the absence of the condition simultaneously
An effect of the treatment is the difference between what did
happen when the treatment was applied and what would
have happened had it not been applied
Simulating the Counterfactual,
cont’d
•
•
•
•
Another approach to simulating the counterfactual: Casecontrol studies
While it is said that the “gold standard” of research is the
prospective, double-blind design with random assignment of
subjects to conditions of the IV, in some cases it’s not ethically
justifiable to do it, nor practically feasible in terms of how long it
might take to get an adequate number of cases.
Case-control study is an alternative where you find cases
already assigned to a level of the independent variable (people
who have diabetes, people who have multiple DUIs, etc) and
match (on possible confounding variables) and compare them
with controls who don’t have these issues, as if the “nots” were
the counterfactual (what would happen if the people who have
the condition didn’t have it (counter to fact) –eg, the very same
people both with and without diabetes
This type of design is retrospective, not double blind, does not
involve random assignment but can still be very powerful. It
looks backwards to try to identify the causes of effects that
have already occurred
Causal Relationship
•
•
•
•
•
Causal Relationship is said to exist if
the putative cause (X) preceded the effect (Y),
the putative cause (X) is associated with the effect (Y)
and other plausible explanations have been ruled out
(not just explanations which explain Y, but
explanations Z which may explain X, Y or the XY
relationship).
Thus given our data from the employment2.sav file,
although gender precedes employment category, we
could not conclude that gender (X) was the cause of
the individual’s job category (Y) despite strong XY
association, because we could not rule out the impact
of gender on educational attainment (Z), and the
subsequent impact of educational attainment on job
category
Causal Relationship, cont’d
• In an experiment we select variables which
are logically/chronologically prior to the
dependent variable to be treatment
variables (IV)
• We observe the effect on the DV of
variation or manipulation of the IV ( we
note their association)
• We attempt to rule out (control for)
competing explanations, e.g. identifying
confounds which might explain the
observed association between the IV and
the DV
Nonmanipulable Variables, Analogue
Experiments, Causal Description vs. Causal
Explanation
•
•
•
•
•
The Shadish et al book would not consider gender to be a
‘cause’ in a proper experiment because it can’t be
manipulated to see what happens.
They argue that naturally occurring IVs like gender have so
many co-variates due to life experience that is a different
order of problem to try to find and attribute causes to them
Much stronger inference is possible if you can manipulate
IVs, for example, dosage in medical study, word choice in
media messages, etc
Analogue experiments: taking a nonmanipulable variable
like gender and simulating it, such as dressing a
confederate of the experimenter as male or female or even
finer gradations of the “femininity” variable
Causal description (being able to show that systematic
manipulation of the IV produces consequences for the DV)
is different from being able to explain why this effect occurs
Molar vs. Molecular Causation
• Molar causation: IV conceptualized and
measured at a macro level encompassing
all constituent elements, for example,
exposure to graphic violence
•
Enables descriptive causation)
•
Enables explanatory causation in part by virtue
of enabling the detection of interaction effects
and limiting conditions
• Molecular causation: IV is conceptualized
and measured at the micro level of its
constituent elements such as level of
nervous system arousal, empathy, social
learning, etc.
Moderator vs. Mediator Variables
•
•
Moderator variable: one which determines the
conditions under which a described causal relationship
holds (increasing the frequency of broadcast of car
commercials in which the dealer himself appears
increases his sales among low-income prospective
buyers but not among high-income prospective
buyers). Effect of the IV depends on the value of the
ModV
Mediator variable: a link in the causal chain between IV
and DV. For example, educational attainment might be
called a mediator variable between gender and job
category. A variable is a strong mediator if it has a
strong association with both IV and DV, but the
relationship between the IV and the DV reduces to zero
when the MedV is entered into the relationship model
Review of Types of Experimentation
•
Experiment: Manipulate levels of an IV (treatments) to
observe its affects
•
•
Randomized Experiment: Assign cases to levels of the
treatment by some random process such as using a random
number table or generator
Quasi-experiments involve comparisons between naturally
occurring treatment groups (by self-selection or
administrative selection). Researcher does not control group
assignment or treatment, but has control over when/what to
observe (DV)
•
•
•
Example might be people who work regular daytime hours vs.
the night shift;
Researcher must rely on statistical controls to rule out
extraneous variables such as other ways in which the treatment
groups differ than the IV of interest.
Search for counterexamples and competing explanations is
inherently falsificationist, as is searching for moderators and
limiting conditions
Types of Experiments, cont’d
• “Natural experiments” might typically involve
before and after designs where you look at a
DV of interest before and after some
phenomenon that has occurred, for example,
tying Presidential approval ratings to
revelations about bailed-out bank excesses
• Non-experimental designs (correlational
studies) are basically cross-sectional studies
which are correlational in nature in which the
researcher makes an effort to establish
causal influence through measuring and
statistical control of competing explanations
Construct Validity and External
Validity
•
Suppose I am doing a study on the impact of font size and
face on usability of Web pages by the elderly. If I conduct a
study in which I vary Web page default font size (10 pt, 12,
pt, 14 pt, 16 pt) and face (serif, sans-serif) and then
measure the time of first page pull to 1 minute after last
page pull by a group of people in an assisted living facility, I
have two sorts of generalizability concerns.
• One, called construct validity, is how do I get from the
particular units, treatments, and observations of my study
to the more general constructs they represent. That is, is
this study useful in answering the question I really want
to get at, which is, if we make adaptations to Web pages
that take into account the physical limitations associated
with aging, will people spend more time on a Web site?
Do these specific operationalizations tap the actual
constructs (page design, time spent on the site) whose
causal relationship we are seeking to understand?
External validity
• The other, called external validity, is
whether the causal relationships
observed in this particular study hold
across variations in units, treatments,
observations, and settings. Would
elderly still living at home respond in the
same way? Would the results apply to
cases where the elderly were allowed to
set their own font size? Can we
generalize our results to unstudied
persons, web sites, page designs, etc.?
Improving the Ability to Generalize
•
•
•
•
What are some ways to improve construct and
external validity? (e.g., overcoming the “local” nature
of the typical experiment)
Most obviously, and most difficult to achieve in
practice, is some form of probability sampling
(random, cluster, stratified random, etc) of units
(subjects or cases), treatments, observations, and
settings
In practice, where do you get the sampling frame (list
of all units making up the population of treatments,
settings, etc) from which to randomly draw?
More likely the researcher will seek in the selection of
units, treatments, settings, and observations to
emphasize diversity (heterogeneity) and/or
representativeness (typicality)
Validity
•
•
Notion of validity as a property of inferences, rather
than a property of the experiment
The truth of any claims made about the results of an
experiment are assessed by various standards
• Correspondence between the claim and the “external
world” of empirical evidence
• Embedding of a claim within a network of relevant
theory and claims; internal consistency, fidelity
• Pragmatic utility of the claim in explaining that which
is difficult to understand, ruling out alternative
explanations;
• Acceptance by other scientists; truth as a social
construction
Types of Validity
• Statistical conclusion validity: proper
use of statistics to make inferences about
•
•
The nature of the covariation between variables
The strength of that relationship
• Internal validity: extent to which a causal
inference can be reasonably made about
the observed covariation given the
particulars of the manipulation (treatments)
and measurements
Types of validity, cont’d
•
•
•
Construct validity: extent to which experimental
operationalizations and procedures are valid indicators
of the higher order constructs they represent
External validity: extent to which inferences about
the causal relationship hold up under other UTOS
(units, treatments, observations, settings)
Validity analysis is about the process of identifying
potential threats to these four types of validity and
controlling them or eliminating them wherever
possible and directly assessing their impact if not
• This is a theory-laden process and in practice it is
difficult to identify all relevant threats, particularly all
plausible alternatives
Statistical Conclusion Validity
• Threats to statistical conclusion validity:
improper use of statistics to make inferences
about the nature of the covariation between
variables (e.g., making a type I or type II
error) and the strength of that relationship
(mistakenly estimating the magnitude of
covariation or the degree of confidence we
can have in it)
• Recommended that statistical hypothesis test
reporting be supplemented with reporting of
effect sizes (r2 or partial eta2), power and
confidence intervals around the effect sizes
Threats to Statistical Conclusion
Validity
•
Identification of several specific threats to statistical
conclusion validity
•
Low statistical power
• Power analysis has the purposes of deciding how large a
sample size you need to get reliable results, and how
much power you have to detect a significant covariation
among variables if it in fact exists
• Beyond a certain sample size the law of diminishing
returns applies and in fact if a sample is large it can
“detect” an effect that is of little real-world significance
(i.e., you will obtain statistical significance but the
amount of variation in DV explained by IV will be very
small)
Statistical conclusion validity,
cont’d
•
•
Statistical power usually should be .80 or higher.
Example of low power problem: failing to reject
the null hypothesis when it is false because your
sample size is too small. So suppose there is in
fact a significant increase in side effects
associated with higher doses of a drug, but you
did not detect it in your sample of size 40 because
your power was too low; doctors will then go
ahead and prescribe the higher dose without
warning their patients that they could experience
an increase in side effects. You could deal with
this problem by increasing the sample size and/or
setting your alpha region error rate to a larger
area than .05, for example .10 or .20
Calculating Statistical Power
•
Power can vary as a function of the robustness of the
statistical test, the sample size, and anything that could
make an effect “hard to detect” such as measurement error
or the fact that it really is a small effect
•
Here’s an online power calculator that tells you for
various sample sizes, alpha levels, number of subjects,
and expected mean differences and standard deviation in
the populations what level of statistical power you can
expect. Let’s suppose you anticipate a small effect (a
small difference of means of only two points between
your two populations, but you believe that this is not just
the result of sampling error. Will you be able to detect
this effect?
Calculating statistical power, con’td
•
Try out the calculator with these values: The mean of
population 1 = 50, the mean of population 2 = 52, their
standard deviations each = 5, you’re doing a one-tailed
test, the significance level (likelihood of rejecting the null
hypothesis when it is in fact true, or Type I error rate) is
.05, and the sample size from each of your two
populations is 20 and 20, respectively. What is your
power level? .352. Now increase your sample sizes to
40 and 40. How does that affect your power to detect
the effect? Still not very good. How about decreasing
your measurement error? (SDs = 2 for both samples)
That helps a lot (99.8%). Now set your SDs back to 5
but change the confidence level from .05 to .20. That
also improves the power with a small sample with large
error, but probably won’t impress journal editors.
Threats to Statistical Conclusion
Validity, cont’d
•
•
•
The power to detect an effect is a complicated product of
several interacting factors such as measurement error, size of
the predicted effect, sample size, and Type 1 error rate
Shadish et al. provide a comprehensive list of strategies for
improving power to detect an effect (find differences
between treatments or levels of the IV) (Table 2.3). These
include increasing reliability of measure, increasing
treatment strength, measuring and correcting for
covariates, using homogeneous participants, equalizing cell
sizes (N of subjects assigned to conditions) Many of these
have to do with reducing possible sources of random and
measurement error
In addition to inadequate power, there are further threats to
statistical conclusion validity:
Threats to Statistical Conclusion
Validity, cont’d
• Failing to meet the assumptions of the
test statistic, for example, that
observations within a sample are
independent in a t-test, which might result in
getting significant differences between two
samples but the real difference is more
attributable to other factors the subjects had
in common such as being from the same
neighborhood or SES rather than the
treatment they were exposed to; violating
other assumptions like equality of population
means, interval level data, normality of
populations with respect to the variable of
interest, etc.
Statistical Conclusion Validity,
cont’d
• Type I Error rate when there are
multiple statistical tests. What starts
out as .05 with one test becomes a very
large probability of rejecting the null
hypothesis when it is in fact true with
repeated consultations of the table of the
underlying distribution (normal table, t,
etc.). It’s not the done thing to correlate
20 variables with each other (or to do
multiple post-hoc comparisons after an
ANOVA) and see what turns up significant,
then go back and write your paper about
that “relationship”
Protecting the error rate when
there are multiple tests
•
Bonferroni correction divides the alpha error rate by
the number of tests and then uses the corrected
value in all the tests. This is the way to “play fair”
•
If there are 10 correlations in our fishing
expedition we would set the significance level
required for rejection of the null hypothesis at
alpha=.05/10 or .005. Then even if we conduct
ten tests our experimentwise error rate is still
under .05
•
Not everybody agrees with this assessment, saying
that we already obsess so much about alpha levels
and keeping them small that many interesting effects
don’t get detected already and get tossed in the trash
can, and this correcting just makes it worse
Threats to Statistical Conclusion
Validity, cont’d
• Unreliability of measure
• Restriction of range: avoid dichotomizing
continuous measures (for example
substituting “tall” and “short” instead of
actual height; using dependent variables
where the distribution is highly skewed and
there are only a few cases in one or the other
ends of the scale
• Lack of standardized implementation of
the treatment or level of the independent
variable (we talked about this before in terms
of things like instructions being memorized
over time, experimenter effects, etc) unless
adaptive application of the treatment is a
more valid instantiation of how the treatment
would occur in the real world
Threats to statistical conclusion
validity, cont’
•
•
Within-subjects variability: In most analyses that
look at effects of treatments you are going to want
your between-treatment variability to be large in
accordance with your research hypothesis, and if there
is a lot of variability among the subjects within the
treatment that may make it more difficult to detect the
predicted effect. Trade-off between ensuring subject
homogeneity within treatments, which increases power
to detect the effect, and possible loss of external
validity
Inaccurate effect-size estimation; recall how we
talked about how the mean is affected by outliers.
Sometimes there are some extreme cases or outliers
that can adversely affect and perhaps inflate the
estimates of effect sizes (differences on the DV
attributable to the treatment or levels of IV)
Internal Validity
• Does the observed covariation between the
IV and the DV constitute a causal
relationship, given the way in which the
variables were manipulated/measured (its
local or molar circumstances)? To qualify it
must be the case that
•
•
The IV is chronologically prior to the DV
No other explanation is plausible
• What are the threats to internal validity?
Threats to Internal Validity
•
•
•
•
Lack of clarity about causal ordering (more of a problem in
correlational studies than in experiments in which you expose
respondents to the treatment and then measure the outcome)
Systematic differences in respondent characteristics on
variables other than the IV of interest. People in the treatment
condition already have more of the DV property for some
unknown, unmeasured reasons. Random assignment and pretesting can reduce this threat from confounding variables
History: any events which intervene between the treatment and
the outcome measure. Example; subjects are presented with antismoking messages but are allowed a break before completing the
post-test and various events happen during their break such as
seeing smokers who are/are not attractive role models, etc. More
of a problem in studies which assess effects over long periods of
time
Maturation: Both history and maturation are problems for
reliability of measure as well as causal attribution. Could changes
in the units of analysis (people, elements such as neighborhoods,
organizations), which have occurred naturally be responsible for
changes in the outcome which the experimenter is trying to
attribute to the treatment?
Threats to Internal Validity, con’t
•
•
Regression to the mean: likely to be a problem in quasiexperiments when members of the group were selected (self- or
administratively-) based on having high or low scores on the DV of
interest. Testing on a subsequent occasion may exhibit
“regression to the mean” where the once-high scorers score lower,
or the once-low scorers score higher, and a treatment effect might
appear when there really isn’t one. Having a really high score on
something (like weight, cholesterol, blood sugar) etc might be
sufficient to motivate a person to self-select into a treatment but
the score might fall back to a lower level just naturally or through
simply deciding to “get help,” although it could be attributed to the
effects of the treatment.
Attrition; selective dropping out of a particular condition or level
of the independent variable by people who had the most extreme
pre-test scores on the DV, so when they drop out it makes the
post-test mean for that condition “look better” and as if that
treatment had a stronger effect since its mean would be lower
without the extreme people
Threats to Internal Validity, con’t
•
•
•
Testing; as mentioned before simply taking a test can create
change which can be mistaken for a treatment effect; can
increase awareness of the DV and induce desire to change
independent of what the treatment can produce (called test
reactivity)
Instrumentation: changes in a measure over time (for
example, coders may become more skilled, may develop
favorite categories as they code more samples) or changes in
its meaning over time
Random assignment can eliminate many potential threats to
internal validity in experiments; in quasi experiments where
that is not possible the experimenter should try to identify as
many threats as possible and eliminate or control for them
Construct Validity
•
Construct validity has to do with the process of making inferences
from the particulars of an experiment, for example its measuring
instruments, to the higher-order constructs they represent
•
•
•
Could apply to the “units” (the subjects or cases, the treatments, the
outcomes (measurements of the DVs) and the setting
Two problems: understanding the construct, and measuring it
With respect to understanding the construct, the most difficult task
is a definitional one in which the researcher decides what the
central or prototypical features of the construct are and delimits it
such that it is also clear what the construct is not
•
•
Example: recent research on frustration based on an understanding of
the concept grounded in the notion of the inability to reach a specific
desired goal owing to circumstances over which the individual has little
or no control. This was a very circumscribed notion placing the locus of
frustration squarely in the particulars of a problematic situation and
not, say, in the more popular use of the term when we describe a
person as frustrated, meaning that they have a general life issue of
being unable to meet goals.
Under this notion of the construct, prototypical features would include
(a) a specific, short-term desired goal (b) barriers to goal attainment
(c) little to no ability to remove the barriers
Practices that Promote Construct
Validity
•
Clear description of the units (subjects, cases), treatments,
outcomes (DV measures), and settings at the desired level of
generalization
•
•
For example, clearly describe the outcome construct “parents’ beliefs
about the role of internal and external factors in their children’s health
care,” identifying the prototypical feature of interest as parental beliefs
about the role of luck, own agency, and experts in children’s health
outcomes. Distinguish it from related constructs such as “parents’
fears for their children’s health,” “parents’ trust of medical
professionals,” or “parents’ approach/avoidance with respect to health
issues”
Select specific instantiations of the construct, such as the “Health
Locus of Control-Parent/Child” scale, which has items like the
following:
•
•
•
•
If my child feels sick, I have to wait for other people to tell me what to
do.
Whenever my child feels sick, I take my child to the doctor right away.
There is nothing I can do to make sure my child has healthy teeth.
I can do many things to prevent my child from having accidents.
Practices that Promote Construct
Validity, cont’d
•
•
In an iterative process, compare the specific
instantiation (the measuring instrument) to the
construct and assess for goodness-of-fit. Note points
of departure and make adjustments (to the measure,
or to the description of the higher-order domain
covered by the construct) as appropriate
Realize that the match will never be perfect because
both the construct and its operational definition are
socially constructed; realize that definitions are
consensually-arrived at constructions and that the
consensus can and will change
Practices to promote construct
validity, cont’d
•
•
Think about a concept such as “middle class” and all the
political baggage that attaches to it. If one wants to
make a case that the middle class is better off today, or
worse off today, the construct can be defined and
operationalized in such a way as to support one’s
preferred outcome
• Similarly, “middle class” can refer as much to behaviors,
practices, values, or perceived social standing as it does
to economic indicators like household income
• If you ask people what class they belong to, most
people will select “middle class,” regardless of their SES
Realize also that constructs and the way they are used
to classify people can have major social ramifications if
social research is taken up and used to justify
workplace/policy decisions, e.g., who is “needy,” who is
an “employment risk,” etc.
Threats to Construct Validity
•
Inadequate construct explication: the researcher
hasn’t done an adequate job of describing the
prototypical characteristics of the construct and
distinguishing it from other related constructs
• Too general, too specific, inaccurate, doesn’t
incorporate method
• Example: women and “spatial reasoning”: how
well or poorly women perform compared to men is
a function of the testing environment (paper and
pencil vs. 3D immersive)
• Explicating a construct like “jurors” in jury
research: People who volunteer for jury studies in
exchange for a free meal are different from people
who resentfully show up for jury duty and try to
get out of it but are impaneled anyhow
More Threats to Construct Validity
•
•
Construct confounding: the measurements may tap some
extraneous constructs not part of the construct of interest;
subjects in the sample are thought to represent “impoverished
urban elderly” because of their participation in free/low cost
meal programs but they may also be the healthy/ ambulatory/
psychologically sturdy elderly who can walk to the centers for
their meals or who can afford to pay but come for the
company. They may differ from other urban seniors on a host
of factors
“Mono-operation bias:” using only one measure of a
construct, say only one example of a “pro-safe-sex” message
to represent the larger construct or one dependent measure of
say, loneliness, where several different measures of the same
construct (for example, several subsets of items from the same
“item universe”) would lend weight to results
Threats to Construct Validity, cont’d
•
•
“Mono-method bias”: this is a tricky one, as using multiple
methods and getting similar outcomes (similar IV>DV
relationship) may improve construct validity, but may also
introduce method variance which can result in nonsignificant
findings due to different sources of random error, which would
not have been the case with a single method
Confounding constructs with levels of constructs:
involves failing to adequately calibrate, at a conceptual and
operational level, the independent variable such that variations
in levels of it can be observed independently and assessed for
effect.
•
May be particularly problematic for effects whose impact on the
DV is curvilinear, such that there is impact at very high or very
low levels of the IV, but not at intermediate levels: ex:
satisfaction with a day’s shopping may be greatest when the
expenditures were very high (I have made an investment
purchase that will last for years) or very low (I am a smart
shopper who got a great bargain), but lowest with a medium level
of expenditure.
Threats to Construct Validity, cont’d
•
Reactive self-report changes: Reactivity refers to
the property of treatments and experimental settings
that produce changes in DVs that can get confused
with the intended effects of a treatment or level of the
IV
• For instance, people wanting acceptance into a
clinical program may over-report or under-report
their symptoms depending upon what they think is
required. Similar to the regression to the mean issue
but the notion is not of a statistical regression where
the need for the treatment peaks before assignment
and then declines due to relief after treatment is
provided, but a change from behavior designed to
induce/avoid assignment (like playing sick to get out
of work) to more representative behavior after the
assignment is made.
Threats to Construct Validity, cont’d
•
Reactivity to the experimental situation: refers to demand
characteristics of the treatments or settings which induce
subjects to form hypotheses about what’s going on and act
accordingly. Various solutions including separating treatments
and outcome measurement as far as possible in time and
space
•
•
Use of Solomon four-group design if sensitivity to the pre-test is
an issue. In Solomon design one experimental and one control
group receive the pre-test, then the treatment (or control
equivalent), then the post-test, while another pair of experimental
and control group receive the treatment (or non-treatment) and
post-test only, no pre-test. Then do a a 2 X 2 between-groups
analysis of variance (Treatment/No Treatment x Pretest/noPre-test) on the posttest scores. A significant interaction
between the Treatment and Pre-test factors would indicate
that subjects were sensitized by the pre-test
Experimenter expectancies: experimenter may convey
his/her hopes and expectations without being aware of it. A
principal motivation for double-blind procedures, or use of
experimenters who are unfamiliar with the hypotheses
Threats to Construct Validity, cont’d
•
•
•
Novelty and disruption effects: Hawthorne effect;
positive change may not be due to the treatment but
to excitement of being in an experiment (being paid
attention, something to liven up the workday)
Compensatory equalization: refusal of managers,
nonresearchers who administer the treatment to
cooperate with the random assignment schedule and
want to make benefits available to all
Compensatory rivalry: occurs when people in a
control or less-favorable treatment condition are
aware of the other more favorable condition and put
forth extra effort to score “high” (or low if required)
on the outcome measures
Threats to Construct Validity, cont’d
•
•
•
Resentful demoralization: the opposite effect where
receiving the less favorable treatment can cause scores on
the outcome measure to be lower than what they would have
been without the knowledge that others were getting a
“better” treatment. These problems are likely to occur in
quasi-experimental, real-world designs but can usually be
controlled in the laboratory setting
Treatment diffusion: when participants in the control
group somehow receive all or part of the treatment. For
example, in the Shanghai study of effect of BSE training on
breast cancer mortality, no reduction of mortality was found
for the trained groups as opposed to the untrained groups.
There was some speculation afterwards that women who
received the training actually trained their friends and
neighbors, some of whom were in the control group (just
speculation)
Wow, what a long list of things that can go wrong!
•
But there’s more….
External Validity
•
External validity is the extent to which a causal relationship,
obtained (or not found) under specific units, treatments,
observations and settings would hold over variations in the units,
treatments, outcomes, and settings. In short, it is mostly
concerned with the UTOS which were *not* in the study
•
For example, with respect to the aforementioned Shanghai study,
would the obtained lack of (BSE training>decreased mortality)
relationship apply to women in Western societies where mammography
might be more readily available, where BSE had been preached for
years as good practice, where the typical woman might have a higher
BMI, etc
•
•
If you were an administrator at the NCI and you were charged with
making a recommendation to the nation’s women about BSE, would you
recommend against it on the basis of such a study? It ran for nearly
ten years and featured thousands of women, so there was ample time
and statistical power to detect an effect for BSE if one were present
What about the recent research suggesting that HRT increases the risk
of stroke? Knowing that many of the study respondents were women
who first began taking HRT when they were in their 60s and older,
would you recommend that women who started taking HRT in their 50s
now stop taking it?
Threats to External Validity
•
Here are ways a causal relationship might not hold across
UTOS
•
•
•
•
Interaction of the causal relationship with units; effect
found with women might not hold for men; might apply in
some zip codes but not others, etc; might apply to guinea pigs
but not people
Interaction of the causal relationship over treatment
variations; for example, relationship between training in using
Blackboard and likelihood of using it in class may vary if one
group is taught with Power Point slides or handouts and another
has to take notes
Interaction of the causal relationship with observations;
variations in measurement of outcome variable will affect
whether or not the obtained causal relationship holds , for
example, measuring hours spent watching TV with self-report
vs. using a usage monitor attached to the TV
Interaction of the causal relationship with settings: in
seniors and Internet study, treatment effect (availability of
computer classes in increasing Internet-based social support
networks) may vary considerably between the new beautiful
cybercafe at one meal site and the unglamorous, less wellfurnished premises available at another
Threats to External Validity, cont’d
•
One final threat is the effect of a mediating variable which has an
impact in one setting or with one class of subjects, but not with
another
•
•
•
•
Example: Effects of gender on job category may be mediated by
educational attainment for certain types of industry but not for others (e.g.,
for creative industries vs. traditional, conservative fields where even women
with advanced degrees may not have equal job status)
In general we can’t hope to get the same effect sizes for a causal
relationships across moderators (different UTOS) but we could at least
hope that the *direction* of the causal impact is consistent, e.g., more
Y leads to more X
The only place a consistent effect size (at least consistent with respect
to small or large) would be really important would be in medical
applications or in research that has clear implications for policy
Random sampling can reduce some but not all threats to external
validity; purposive sampling of groups known to be diverse on unit or
setting variables can improve ability to generalize about causal
relations
So Many Threats to Validity
•
•
Given all of these possible ways in which your
experimental design can be compromised, why do
experiments? Because they provide the closest match
to the process of causal reasoning (cause before
effect, plausible alternatives accounted for)
Most of the threats happen infrequently
• If they happen frequently, they can usually be
anticipated and controlled for with physical controls
or statistically
• They may be discovered after the fact, but it may be
possible to re-analyze data, collapse categories, etc
to deal with the problem
So Why do Experiments?
•
•
•
Studies may be extended to evaluate the role of
moderator variables which might threaten external
validity
• External validity is the most obvious source of lay
criticism but perhaps the last concern of an extended
research program which must address the other
validity issues first
Reporting of research should lay out the possible
threats to the four types of validity and say how they
were addressed. What wasn’t addressed goes under
“limitations”
Your greatest attribute as an experimenter is your
common sense and your everyday understanding of
how the world works; gaps in this knowledge can be
filled in by reading and communicating with others
Download