Evaluation Methodologies

advertisement



Internal Validity
Construct Validity
External Validity
* In the context of a research study, i.e., not
measurement validity.

Generally relevant only to studies with causal
relationships.
◦ Temporal precedence
◦ Correlation
◦ No plausible alternative

Key question: can the outcome be attributed
to causes other than the designed
interventions
◦ If so, it is likely that internal validity needs to be
tightened up

Threats to Internal Validity
◦ Single Group Threats
◦ Multiple Group Threats
◦ Social threats to internal validity
Image an educational
program where two
different testing
regimens are used.
In one, an intervention
and then a post-test is
used.
In the second, a pretest, intervention and
post-test is used.
What are the single
group threats for this
design?

Single Group Threats
◦ History (something happened at the same time)
◦ Maturation (something would have happened at the
same time)
◦ Testing (testing itself induced an effect)
◦ Instrumentation (changes in the testing)
◦ Mortality (attrition in study participants)
◦ Regression (regression to the mean)



Suppose for the previous study we had
multiple groups instead of single groups?
Multiple Group Threats are variations on the
Single Group Threat with selection bias
added. If the added second group is a
control, for instance, it must be selected in a
way that makes it fully comparable to the first
group (random assignment).
If participants cannot be randomly assigned,
then we get quasi-experimental design.

Applicable to social sciences (because people
do not react simply to stimuli)
◦ Diffusion (people in treatment groups talk to one
another)
◦ Compensatory rivalry (treatments groups know
what is happening and develop a rivalry)
◦ Resentful demoralization (same as above, but with
an opposite sign)
◦ Compensatory equalization (researchers or others
equalize groups).

Are the results valid for other persons in
other places and at other times?
◦ Do they generalize?


Types of generalization
Threats to external validity

Generalizations
◦ Sampling Model: try to make certain that your study
groups are a random sample of the population you
wish your generalization to extend to.
◦ “Proximal Similarity”: measure or stratify the sample
on the things you cannot randomize.

Threats to external validity
◦ People
◦ Places
◦ Times


An assessment of how well ideas or theories
are translated into actual programs.
Mapping of concrete activities into theoretical
constructs.

Formal articulations:
◦ Nomological network (Cronbach and Meehl, 1955):
researchers were to establish a theoretical network
of what to measure, empirical frameworks of what
to measure and the linkages between the two.
◦ Multitrait-Multimethod Matrix (Campbell and Fiske,
1959): Convergent concepts should show higher
correlations divergent concepts lower correlations.
◦ Pattern matching (Trochim, 1985): Linking a
theoretical pattern with an operational pattern.

Threats to Construct Validity
◦ Poorly defined constructs
◦ Mono-operation bias: The construct is larger than
the single program / treatment you devised.
◦ Mono-method bias: the construct is larger than the
limited set of measurements you devised.
◦ Test and treatment interaction: measurement
changes the treatment group
◦ Other threats generally fall under “labeling” threats:
a construct is essentially a metaphor, and if not
precisely articulated differing meanings can be held
by different persons.

Social Threats to Construct Validity
◦ Hypothesis guessing: participants guess at the
purpose of your study and attempt to game it.
◦ Evaluation apprehension: if apprehension causes
participants to do poorly (or to pose as doing well)
then the apprehension becomes a confounding
factor.
◦ Researcher expectancies: Researcher expectancies
confound the outcome.
 Hawthorne effect: people change behavior when
observed
 Rosenthal effect: researcher expectations can change
outcomes even when subjects are uninformed.



Authors see methodology as intellectual
infrastructure.
Believe that rapid change in CS produces
outdated methodology.
Three key claims:
◦ Workloads used need to be appropriate
◦ Experimental design needs to be appropriate
◦ Analysis needs to be rigorous

For this paper, the authors focus on Java
◦ Modern language additions (type safety, memory
management, secure execution) have been added to
Java
◦ Authors believe that these additions make previous
benchmarks untenable:
 Tradeoffs due to garbage collection where heap size is a
control variable
 Non-determinism due to adaptive optimization and
sampling technologies
 System warm-up from dynamic class loading and just-intime compilation



Authors created a suite (DaCapo) of
benchmark tools suitable for research. The
suite consists of open source applications.
DaCapo validates diversity a variety of tests
and then applying PCA.
Authors point to “cherry picking” research by
Perez, showing that dropping diversity of
measures increases ambiguous and incorrect
conclusions.



The authors in their results show four ways to
evaluate garbage collection. Any specific
measure can be “gamed” to produce a desired
result.
Classic comparison of Fortran / C / C++:
control for host platform and language
runtime.
New comparisons: control for host platform,
language runtime, heap size, nondeterminism
and warm-up.



To obtain meaningful data from noisy
estimates, data must be collected and
aggregated.
Current practices sometimes lack statistical
rigor.
Presenting all the results from the suite (as
opposed to one number) will reduce “cherry
picking”.
Download