Basic Concepts

advertisement
William D. Crano
Claremont Graduate University
1



Who are we?
Why are we here?
And now, some problems for team
consideration and discussion
2
 Group 1. How would you test an anti-drug media campaign (e.g.,




“This is your brain on drugs”) if, for political reasons, the program was
rolled out before a pretest was given, and with no control group?
Group 2. Daughters of women who took DES while pregnant seem to
be more prone to cervical cancer. Did DES cause this? Think about
design features, alternative hypotheses, etc.
Group 3. Has California’s 3-Strikes law delivered on its promise to
deter serious crime?
Group 4. The 2006 Massachusetts health care reform law requires
nearly every resident to obtain health insurance. Has this helped the
health of the Bay State’s citizenry?
Group 5. Is receipt of a National Merit Scholarship a harbinger of later
academic achievement?
3
Basics: Science, causation, reliability, validity
Two Randomized Experimental Designs
Threats to internal validity
Added threats in quasi-experimental designs
Regression artifact
Primitive quasi-experimental designs
Case control designs
Slightly less primitive quasi-experimental designs
Matching
Interrupted time series
Regression/Discontinuity analysis
4
Science -- Series of agreed-upon operations involving logic,
data-checking feedback, and consensus
Critical Feature: We consider only observable phenomena
measured by replicable instruments
Critical Step: Moving from Concept to Operation (measure)
•
Redefine abstraction in empirical terms
•
Operationalize (Specify instruments & procedures)
Problem: Lack of Fit between concept and operation
Our compromise: Indirect Operationism & Triangulation -- can
lead to problems of validity (we’ll discuss later)
5
Possible Solution: Multiple Operationism & Triangulation
• No single translation (concept to measure) is perfect
• All miss mark, but miss differently if error is random
• Therefore, aim for a “heterogeneity of irrelevancies”
• A term coined by Donald Campbell – if you understand
this, you understand a lot about proper measurement
6
Testing in science = a competition among theories; therefore,
defeating a set of strong rival hypotheses lends more credibility
than any single one-shot study (vs. critical test)
Most theories specify a causal relationship among variables.
WHY?
7
Understanding cause = primary role of scientist or evaluator –
allows us to understand or control the nature of events
• How do we infer cause?
• Strength of Association
• Consistency of linkage
• Time precedence -- if A is followed by B consistently, and
opposite is not true, we generally conclude A causes B
• Investigating causation requires control over phenomena of
interest – That kind of control usually is available only in the
randomized experiment (ideally)
8
Variables of concern in experiment:
• Independent: (manipulated independently of its natural
sources of temporal covariation); also called treatment,
manipulation, or intervention (or “disruption” in time series
studies)
• Dependent (affected by, or dependent on, independent);
measure
Only in experimental methods is there a clear specification of
independent and dependent variables
Some believe only experiments can plausibly specify causal
relations
IN THIS WORKSHOP, WE SOMETIMES DISAGREE.
9
Reliability & Validity
• Important features of our constructs and measures:
Notes on Reliability
“Establishes upper-bound of validity” (Huh?)
• Reliability = Replicability = True score (later)
• Internal consistency – Items “hang together”
• Coefficient alpha
• Temporal stability
• Test-retest
• Alternate forms
10
Notes on Validity

Construct validity: extent to which our operations map onto
the higher order construct we think we are studying-- the
correspondence between variations in the scores on our
instrument and variation among respondents on the
underlying attribute being studied
 Research shows AIDS knowledge is not related to behavior
 Do you believe this result?
 You might want to consider the knowledge test items before accepting this
 Can you avoid HIV if you are ignorant of causes?
 How?
11
Notes on Validity

Could it be that the measure of HIV/AIDS knowledge was invalid?
Consider the scale:


Item 1. Have you ever heard of HIV/AIDS?
Item 2. Is HIV/AIDS dangerous to health?
12
Notes on Validity

Internal validity has to do with the relationship between
treatment & outcome: Are variations in outcomes in our
study attributable to the treatment – and only to the
treatment? Experimental design is concerned primarily with
this factor.
13

Pretest-Posttest Control Group Experiment
O
O

X
O
O
Posttest-Only Control Group Experiment
X
O
O
14








History
Maturation
Testing
Instrumentation
Statistical Regression
Selection (also Attrition)
Mortality
Selection by X interactions
Also, in quasi-experiments
 Diffusion of treatment
information
 Alvaro transplantation study



Compensatory treatment
equalization
Compensatory rivalry
Demoralization of controls
 history x selection: event
affects only one group in the
design
15
Statistical Conclusion Validity Threats (power, significance,
effect size)
1.
Low power
2.
Violated assumptions (e.g., nonlinearity)
3.
Multiple tests, inflating alpha
4.
Unreliability of measures
5.
Unreliability of treatment implementation
6.
Random irrelevancies in the experimental setting
7.
Random variations among respondents
16
Construct Validity Threats
1.
2.
3.
4.
5.
6.
7.
Inadequate preoperational explication of constructs
Mono-operation bias: No single indicator is adequate
Hypothesis guessing or HARKing
Evaluation apprehension
Experimenter Expectancies
Interactions of multiple (unanticipated) treatments
Restricted generalizability of treatments
17
Random Assignment and Experimental Control
Randomization requires that all individuals available for a
particular study be potentially able to participate in either the
experimental or control group, and only chance determines group
assignment.
Assuming a large sample, this feature equates individual
variations between groups: thus, the experimental and control
groups are equivalent, initially. Any differences, given appropriate
design, are therefore attributable to the only systematic difference
between the two groups, namely, the experimental treatment.
This is the fundamental logic of experimentation.
BUT, what if randomization is NOT an option?
18
Project Headstart: Economically disadvantaged preschoolers are given classroom training to bring their
achievement scores level with more advantaged kids.
 Does the program work? Your job is to determine if it does.
You are hired to evaluate the effectiveness of Headstart in
Cincinnati, Ohio. You have limited but adequate funding for
the job. How would you do this?
 This is a real-world example. Children cannot be treated as
guinea pigs (or college sophomores).
 The problem is finding an appropriate control group.
 How would you do this job?

19
Treatment
Control
9
8
7
5
4
3
2
1
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
Frequency
6
23
Achievement Score
20
Observed = True Score + error
If groups are chosen on the basis of extreme scores, we
inevitably witness regression to the mean on subsequent
measures.
Why?
Because the use of extreme scores on unreliable tests (they all
are) always capitalizes on error. Error is random. Thus, on
retest, scores will regress toward the means of their
respective distributions. If the means of groups are
different, the direction of the regression artifact will be
different, potentially creating or attenuating apparent
differences between groups.
21
 How would you test an anti-drug media campaign
(e.g., “This is your brain on drugs”) if, for political
reasons, the program was rolled out before a
pretest was given, and with no control group?
22

One group posttest only:
X
O
23

One group posttest only:
X
O
 Variation 1:
X
O
O
O
24

One group posttest only:
X
O
Variation 1:
X
O
O
O
Variation 2
O
X
O
25
Variation 3:
O
O
X
O
Variation 4
O
X
O
(X)
O
26
60
50
40
Interpretable
30
Not
20
10
0
O1
O2
X
O3
O4
(X)
27
 Group 2. Daughters of women who took DES
while pregnant seem to be more prone to cervical
cancer. Did DES cause this? Think about design
features, alternative hypotheses, etc.
28
Not feasible or ethical to experiment
DES given in 1960s suspected of causing vaginal or
cervical cancer in daughters
 Obviously cannot run experiment


29
Nonequivalent control group design
(nonequivalent because there is NO random
assignment to conditions)
O1
O1
X
O2
O2
30
OUTCOME 1
70
60
SCORE
50
40
Control
Treatm ent
30
20
10
0
2nd Qtr
3rd Qtr
TIME
31
OUTCOME 2
50
45
40
SCORE
35
30
Control
25
Treatm ent
20
15
10
5
0
2nd Qtr
3rd Qtr
TIME
32
OUTCOME 3
50
45
40
SCORE
35
30
Control
25
Treatm ent
20
15
10
5
0
2nd Qtr
3rd Qtr
TIME
33
Outcome 4
45
40
35
SCORE
30
25
Control
Treatm ent
20
15
10
5
0
2nd Qtr
3rd Qtr
TIME
34
Outcome 5
35
30
SCORE
25
20
Control
Treatm ent
15
10
5
0
2nd Qtr
3rd Qtr
TIME
35
O1
O1
O2
O2
X
O3
O3
Switching Replications
O1
O1
X
O2
O2
X
O3
O3
36

Propensity Scoring
 Conceptually, matching treated and control
participants on a host of theory-indicated
variables (not just 1 or 2 fallible tests)
 Include all variables thought to be related to the
outcome
 Propensity Score – predicted probability of being
in X or C group – match on score
37

Stratification
 Use conceptually relevant variables
 Break on strata
 Reduces “unexplained” variance
 Danger  need sufficient N so that power isn’t
compromised
38
Series of observations on same variable
Should have many observations (pre & post
interruption)
 Interruption = treatment or intervention
 Shocking event (e.g., Suicide of Robin Williams)
 Killing of Osama bin Laden – effect on casualties?
 New law (Connecticut speed crack-down)
 New form of surgery or drug or diagnostic device


39
Autocorrelation (adjacent data points are
correlated; this defeats the assumption of
independent observations)
 Cyclicity: systematic (or repeating) patterns in the
time series – noncausal
 Both can be controlled for, via ARIMA modeling, but
some worry about overly conservatizing the data, or
creating data that do not reflect reality.

40
 Group 3. Has California’s 3-Strikes law delivered
on its promise to deter serious crime?
 Group 4. The 2006 Massachusetts health care
reform law requires nearly every resident to
obtain health insurance. Has this helped the
health of the Bay State’s citizenry?
41


o1 o2 o3 …o50 X o51 o52 o53 …o100
For analytic purposes, it’s dangerous to use
small #s of observations or data points –
before or after the interruption
42
250
200
Interruption 
150
100
50
T1
9
T1
7
T1
5
T1
3
T1
1
T9
T7
T5
T3
T1
0
Intrruption
43
350
300
250
200
Interruption 
150
100
50
T1
9
T1
7
T1
5
T1
3
T1
1
T9
T7
T5
T3
T1
0
Interruption
44







History
Instrumentation (crime, disease) increasing, or
simply a change in classification?
Selection (would effects be the same with different
cohort)
Statistical conclusion validity (huge potential
problem)
Maturation (less likely)
Change of indicator at precisely the wrong time
Construct Validity – Typically we use only 1
indicator. This is always dangerous
45

Nonequivalent No-treatment Control
 Barcelona’s law that helmets be used by riders of small
motorcycles (large cycle riders had law in effect for years,
& served as control).
 Ramirez & Crano’s 3-strikes analysis – petty crime not
penalized with strikes, thus should not be affected
[Problem: assumes criminals know intricacies of law, and
also know how a crime will go down]
46

Does a National Merit Scholarship affect
achievement (of already high achieving high
school students)?
47
48
49
50
51
Problems?
Selection x Maturation? No, unless a discontinuous
maturation process occurred precisely at the cut point – most
unlikely
 Regression artifact? No, shouldn’t produce discontinuity
 Modeling the relationship: If relation is nonlinear, you’re
cooked
 Low power
 Why use it? Because it delivers treatment precisely to those
who need it, or will profit most from it
 Can’t have cutoff overrides (politics); Crossovers; fuzzy
cutoff, or post-hoc elimination to eliminate curvilinearity

52
Problems?
History? Not unless the event occurred precisely at the
cutoff
 Testing? No, both groups receive same test
 Instrumentation? No – both groups scored by same model
 Maturation – those above cutoff may be maturing faster
than those below – could happen – but this would cause a
curvilinear plot that you would model
 Mortality (attrition) – could be a major problem. If worse
students dropped from our National Merit example – but we
would know if this were happening

53
Thank you for your attention and hard work
54
Download