Internal Validity Construct Validity External Validity * In the context of a research study, i.e., not measurement validity. Generally relevant only to studies with causal relationships. ◦ Temporal precedence ◦ Correlation ◦ No plausible alternative Key question: can the outcome be attributed to causes other than the designed interventions ◦ If so, it is likely that internal validity needs to be tightened up Threats to Internal Validity ◦ Single Group Threats ◦ Multiple Group Threats ◦ Social threats to internal validity Image an educational program where two different testing regimens are used. In one, an intervention and then a post-test is used. In the second, a pretest, intervention and post-test is used. What are the single group threats for this design? Single Group Threats ◦ History (something happened at the same time) ◦ Maturation (something would have happened at the same time) ◦ Testing (testing itself induced an effect) ◦ Instrumentation (changes in the testing) ◦ Mortality (attrition in study participants) ◦ Regression (regression to the mean) Suppose for the previous study we had multiple groups instead of single groups? Multiple Group Threats are variations on the Single Group Threat with selection bias added. If the added second group is a control, for instance, it must be selected in a way that makes it fully comparable to the first group (random assignment). If participants cannot be randomly assigned, then we get quasi-experimental design. Applicable to social sciences (because people do not react simply to stimuli) ◦ Diffusion (people in treatment groups talk to one another) ◦ Compensatory rivalry (treatments groups know what is happening and develop a rivalry) ◦ Resentful demoralization (same as above, but with an opposite sign) ◦ Compensatory equalization (researchers or others equalize groups). Are the results valid for other persons in other places and at other times? ◦ Do they generalize? Types of generalization Threats to external validity Generalizations ◦ Sampling Model: try to make certain that your study groups are a random sample of the population you wish your generalization to extend to. ◦ “Proximal Similarity”: measure or stratify the sample on the things you cannot randomize. Threats to external validity ◦ People ◦ Places ◦ Times An assessment of how well ideas or theories are translated into actual programs. Mapping of concrete activities into theoretical constructs. Formal articulations: ◦ Nomological network (Cronbach and Meehl, 1955): researchers were to establish a theoretical network of what to measure, empirical frameworks of what to measure and the linkages between the two. ◦ Multitrait-Multimethod Matrix (Campbell and Fiske, 1959): Convergent concepts should show higher correlations divergent concepts lower correlations. ◦ Pattern matching (Trochim, 1985): Linking a theoretical pattern with an operational pattern. Threats to Construct Validity ◦ Poorly defined constructs ◦ Mono-operation bias: The construct is larger than the single program / treatment you devised. ◦ Mono-method bias: the construct is larger than the limited set of measurements you devised. ◦ Test and treatment interaction: measurement changes the treatment group ◦ Other threats generally fall under “labeling” threats: a construct is essentially a metaphor, and if not precisely articulated differing meanings can be held by different persons. Social Threats to Construct Validity ◦ Hypothesis guessing: participants guess at the purpose of your study and attempt to game it. ◦ Evaluation apprehension: if apprehension causes participants to do poorly (or to pose as doing well) then the apprehension becomes a confounding factor. ◦ Researcher expectancies: Researcher expectancies confound the outcome. Hawthorne effect: people change behavior when observed Rosenthal effect: researcher expectations can change outcomes even when subjects are uninformed. Authors see methodology as intellectual infrastructure. Believe that rapid change in CS produces outdated methodology. Three key claims: ◦ Workloads used need to be appropriate ◦ Experimental design needs to be appropriate ◦ Analysis needs to be rigorous For this paper, the authors focus on Java ◦ Modern language additions (type safety, memory management, secure execution) have been added to Java ◦ Authors believe that these additions make previous benchmarks untenable: Tradeoffs due to garbage collection where heap size is a control variable Non-determinism due to adaptive optimization and sampling technologies System warm-up from dynamic class loading and just-intime compilation Authors created a suite (DaCapo) of benchmark tools suitable for research. The suite consists of open source applications. DaCapo validates diversity a variety of tests and then applying PCA. Authors point to “cherry picking” research by Perez, showing that dropping diversity of measures increases ambiguous and incorrect conclusions. The authors in their results show four ways to evaluate garbage collection. Any specific measure can be “gamed” to produce a desired result. Classic comparison of Fortran / C / C++: control for host platform and language runtime. New comparisons: control for host platform, language runtime, heap size, nondeterminism and warm-up. To obtain meaningful data from noisy estimates, data must be collected and aggregated. Current practices sometimes lack statistical rigor. Presenting all the results from the suite (as opposed to one number) will reduce “cherry picking”.