William D. Crano Claremont Graduate University 1 Who are we? Why are we here? And now, some problems for team consideration and discussion 2 Group 1. How would you test an anti-drug media campaign (e.g., “This is your brain on drugs”) if, for political reasons, the program was rolled out before a pretest was given, and with no control group? Group 2. Daughters of women who took DES while pregnant seem to be more prone to cervical cancer. Did DES cause this? Think about design features, alternative hypotheses, etc. Group 3. Has California’s 3-Strikes law delivered on its promise to deter serious crime? Group 4. The 2006 Massachusetts health care reform law requires nearly every resident to obtain health insurance. Has this helped the health of the Bay State’s citizenry? Group 5. Is receipt of a National Merit Scholarship a harbinger of later academic achievement? 3 Basics: Science, causation, reliability, validity Two Randomized Experimental Designs Threats to internal validity Added threats in quasi-experimental designs Regression artifact Primitive quasi-experimental designs Case control designs Slightly less primitive quasi-experimental designs Matching Interrupted time series Regression/Discontinuity analysis 4 Science -- Series of agreed-upon operations involving logic, data-checking feedback, and consensus Critical Feature: We consider only observable phenomena measured by replicable instruments Critical Step: Moving from Concept to Operation (measure) • Redefine abstraction in empirical terms • Operationalize (Specify instruments & procedures) Problem: Lack of Fit between concept and operation Our compromise: Indirect Operationism & Triangulation -- can lead to problems of validity (we’ll discuss later) 5 Possible Solution: Multiple Operationism & Triangulation • No single translation (concept to measure) is perfect • All miss mark, but miss differently if error is random • Therefore, aim for a “heterogeneity of irrelevancies” • A term coined by Donald Campbell – if you understand this, you understand a lot about proper measurement 6 Testing in science = a competition among theories; therefore, defeating a set of strong rival hypotheses lends more credibility than any single one-shot study (vs. critical test) Most theories specify a causal relationship among variables. WHY? 7 Understanding cause = primary role of scientist or evaluator – allows us to understand or control the nature of events • How do we infer cause? • Strength of Association • Consistency of linkage • Time precedence -- if A is followed by B consistently, and opposite is not true, we generally conclude A causes B • Investigating causation requires control over phenomena of interest – That kind of control usually is available only in the randomized experiment (ideally) 8 Variables of concern in experiment: • Independent: (manipulated independently of its natural sources of temporal covariation); also called treatment, manipulation, or intervention (or “disruption” in time series studies) • Dependent (affected by, or dependent on, independent); measure Only in experimental methods is there a clear specification of independent and dependent variables Some believe only experiments can plausibly specify causal relations IN THIS WORKSHOP, WE SOMETIMES DISAGREE. 9 Reliability & Validity • Important features of our constructs and measures: Notes on Reliability “Establishes upper-bound of validity” (Huh?) • Reliability = Replicability = True score (later) • Internal consistency – Items “hang together” • Coefficient alpha • Temporal stability • Test-retest • Alternate forms 10 Notes on Validity Construct validity: extent to which our operations map onto the higher order construct we think we are studying-- the correspondence between variations in the scores on our instrument and variation among respondents on the underlying attribute being studied Research shows AIDS knowledge is not related to behavior Do you believe this result? You might want to consider the knowledge test items before accepting this Can you avoid HIV if you are ignorant of causes? How? 11 Notes on Validity Could it be that the measure of HIV/AIDS knowledge was invalid? Consider the scale: Item 1. Have you ever heard of HIV/AIDS? Item 2. Is HIV/AIDS dangerous to health? 12 Notes on Validity Internal validity has to do with the relationship between treatment & outcome: Are variations in outcomes in our study attributable to the treatment – and only to the treatment? Experimental design is concerned primarily with this factor. 13 Pretest-Posttest Control Group Experiment O O X O O Posttest-Only Control Group Experiment X O O 14 History Maturation Testing Instrumentation Statistical Regression Selection (also Attrition) Mortality Selection by X interactions Also, in quasi-experiments Diffusion of treatment information Alvaro transplantation study Compensatory treatment equalization Compensatory rivalry Demoralization of controls history x selection: event affects only one group in the design 15 Statistical Conclusion Validity Threats (power, significance, effect size) 1. Low power 2. Violated assumptions (e.g., nonlinearity) 3. Multiple tests, inflating alpha 4. Unreliability of measures 5. Unreliability of treatment implementation 6. Random irrelevancies in the experimental setting 7. Random variations among respondents 16 Construct Validity Threats 1. 2. 3. 4. 5. 6. 7. Inadequate preoperational explication of constructs Mono-operation bias: No single indicator is adequate Hypothesis guessing or HARKing Evaluation apprehension Experimenter Expectancies Interactions of multiple (unanticipated) treatments Restricted generalizability of treatments 17 Random Assignment and Experimental Control Randomization requires that all individuals available for a particular study be potentially able to participate in either the experimental or control group, and only chance determines group assignment. Assuming a large sample, this feature equates individual variations between groups: thus, the experimental and control groups are equivalent, initially. Any differences, given appropriate design, are therefore attributable to the only systematic difference between the two groups, namely, the experimental treatment. This is the fundamental logic of experimentation. BUT, what if randomization is NOT an option? 18 Project Headstart: Economically disadvantaged preschoolers are given classroom training to bring their achievement scores level with more advantaged kids. Does the program work? Your job is to determine if it does. You are hired to evaluate the effectiveness of Headstart in Cincinnati, Ohio. You have limited but adequate funding for the job. How would you do this? This is a real-world example. Children cannot be treated as guinea pigs (or college sophomores). The problem is finding an appropriate control group. How would you do this job? 19 Treatment Control 9 8 7 5 4 3 2 1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Frequency 6 23 Achievement Score 20 Observed = True Score + error If groups are chosen on the basis of extreme scores, we inevitably witness regression to the mean on subsequent measures. Why? Because the use of extreme scores on unreliable tests (they all are) always capitalizes on error. Error is random. Thus, on retest, scores will regress toward the means of their respective distributions. If the means of groups are different, the direction of the regression artifact will be different, potentially creating or attenuating apparent differences between groups. 21 How would you test an anti-drug media campaign (e.g., “This is your brain on drugs”) if, for political reasons, the program was rolled out before a pretest was given, and with no control group? 22 One group posttest only: X O 23 One group posttest only: X O Variation 1: X O O O 24 One group posttest only: X O Variation 1: X O O O Variation 2 O X O 25 Variation 3: O O X O Variation 4 O X O (X) O 26 60 50 40 Interpretable 30 Not 20 10 0 O1 O2 X O3 O4 (X) 27 Group 2. Daughters of women who took DES while pregnant seem to be more prone to cervical cancer. Did DES cause this? Think about design features, alternative hypotheses, etc. 28 Not feasible or ethical to experiment DES given in 1960s suspected of causing vaginal or cervical cancer in daughters Obviously cannot run experiment 29 Nonequivalent control group design (nonequivalent because there is NO random assignment to conditions) O1 O1 X O2 O2 30 OUTCOME 1 70 60 SCORE 50 40 Control Treatm ent 30 20 10 0 2nd Qtr 3rd Qtr TIME 31 OUTCOME 2 50 45 40 SCORE 35 30 Control 25 Treatm ent 20 15 10 5 0 2nd Qtr 3rd Qtr TIME 32 OUTCOME 3 50 45 40 SCORE 35 30 Control 25 Treatm ent 20 15 10 5 0 2nd Qtr 3rd Qtr TIME 33 Outcome 4 45 40 35 SCORE 30 25 Control Treatm ent 20 15 10 5 0 2nd Qtr 3rd Qtr TIME 34 Outcome 5 35 30 SCORE 25 20 Control Treatm ent 15 10 5 0 2nd Qtr 3rd Qtr TIME 35 O1 O1 O2 O2 X O3 O3 Switching Replications O1 O1 X O2 O2 X O3 O3 36 Propensity Scoring Conceptually, matching treated and control participants on a host of theory-indicated variables (not just 1 or 2 fallible tests) Include all variables thought to be related to the outcome Propensity Score – predicted probability of being in X or C group – match on score 37 Stratification Use conceptually relevant variables Break on strata Reduces “unexplained” variance Danger need sufficient N so that power isn’t compromised 38 Series of observations on same variable Should have many observations (pre & post interruption) Interruption = treatment or intervention Shocking event (e.g., Suicide of Robin Williams) Killing of Osama bin Laden – effect on casualties? New law (Connecticut speed crack-down) New form of surgery or drug or diagnostic device 39 Autocorrelation (adjacent data points are correlated; this defeats the assumption of independent observations) Cyclicity: systematic (or repeating) patterns in the time series – noncausal Both can be controlled for, via ARIMA modeling, but some worry about overly conservatizing the data, or creating data that do not reflect reality. 40 Group 3. Has California’s 3-Strikes law delivered on its promise to deter serious crime? Group 4. The 2006 Massachusetts health care reform law requires nearly every resident to obtain health insurance. Has this helped the health of the Bay State’s citizenry? 41 o1 o2 o3 …o50 X o51 o52 o53 …o100 For analytic purposes, it’s dangerous to use small #s of observations or data points – before or after the interruption 42 250 200 Interruption 150 100 50 T1 9 T1 7 T1 5 T1 3 T1 1 T9 T7 T5 T3 T1 0 Intrruption 43 350 300 250 200 Interruption 150 100 50 T1 9 T1 7 T1 5 T1 3 T1 1 T9 T7 T5 T3 T1 0 Interruption 44 History Instrumentation (crime, disease) increasing, or simply a change in classification? Selection (would effects be the same with different cohort) Statistical conclusion validity (huge potential problem) Maturation (less likely) Change of indicator at precisely the wrong time Construct Validity – Typically we use only 1 indicator. This is always dangerous 45 Nonequivalent No-treatment Control Barcelona’s law that helmets be used by riders of small motorcycles (large cycle riders had law in effect for years, & served as control). Ramirez & Crano’s 3-strikes analysis – petty crime not penalized with strikes, thus should not be affected [Problem: assumes criminals know intricacies of law, and also know how a crime will go down] 46 Does a National Merit Scholarship affect achievement (of already high achieving high school students)? 47 48 49 50 51 Problems? Selection x Maturation? No, unless a discontinuous maturation process occurred precisely at the cut point – most unlikely Regression artifact? No, shouldn’t produce discontinuity Modeling the relationship: If relation is nonlinear, you’re cooked Low power Why use it? Because it delivers treatment precisely to those who need it, or will profit most from it Can’t have cutoff overrides (politics); Crossovers; fuzzy cutoff, or post-hoc elimination to eliminate curvilinearity 52 Problems? History? Not unless the event occurred precisely at the cutoff Testing? No, both groups receive same test Instrumentation? No – both groups scored by same model Maturation – those above cutoff may be maturing faster than those below – could happen – but this would cause a curvilinear plot that you would model Mortality (attrition) – could be a major problem. If worse students dropped from our National Merit example – but we would know if this were happening 53 Thank you for your attention and hard work 54