Strategies to Help Strengthen Validity and Reliability of Data

Copyright Kathryn E. Newcomer
January 2011
Validity generally refers to the accuracy and representativeness of a study or data, and the term is
sometimes casually used to characterize whole studies as valid, or “scientifically valid.” But
validity is a multidimensional concept, and since its many dimensions are never attained
perfectly, the question is not “Is the study or are the data valid” but “How valid are they in terms
of measurement, internal, external, and statistical conclusion validity?"
Limitations to any of the dimensions will reduce the ability to draw causal inferences (internal
validity.) One might view the types of validity as a pyramid with measurement validity and
reliability on the bottom, then external and statistical conclusion, with internal validity on the
top. And each limitation affects all of the dimensions of validity above it.
Measurement Validity
Measurement validity refers to the question, Are we accurately measuring what we really intend
to measure? Measurement validity is concerned with the accuracy of measurement.
We start with a concept, such as maternal health, and we identify measurement procedures that
we can use to operationalize the more abstract concept into empirically-observable indicators.
For example, we may measure student reading ability with standardized reading exams.
Evaluators may record the empirical indicators from existing data sets or create new ways to
measure concepts of interest. Measurement validity refers to whether or not the empirical
indicators accurately portray the concept of interest. Any time we use a proxy variable for
something we want to capture or measure, we have to consider the degree to which our measure
is valid. We need to make judgments about the adequacy of our measures, and we typically try to
validate them by testing how well they capture what we want to measure.
Measurement Reliability
Measurement reliability is the extent to which a measurement can be expected to produce similar
results on repeated observations of the same condition or event. Is a question asked in the same
way? Is information collected in the same way from one item to the next? Would anyone get the
same answer if they repeated the question or data collection task? If not, your evidence may not
be reliable or, therefore, competent.
Measurement reliability pertains to both reliable measures and reliable measurement.
Reliable measures means that operations consistently measure the same phenomena.
Reliable measurement means consistently recording data with the same decision
External Validity or Generalizability
External validity refers to the question, Are we able to generalize from the results? If a study is
generalizable (or externally valid), we are able to apply the study’s results to groups or contexts
beyond those being studied. For example, if we studied occurrences at three high-priority
Superfund sites, to what degree could the study’s results be generalized to the overall
management of the Superfund program? How large would a survey sample need to be to
generalize to the universe of high-school students in the U.S.?
Internal Validity
Internal validity refers to the question, Are we able to definitely establish that there is a causal
relationship between a specified cause and potential effect?
If a study is internally valid, it means that it can be determined whether A (a program, policy,
external event, regulation, management action, etc.) caused B (a gain in reading, a drop in
employment, a change in mortality, etc.) and in what magnitude. To conclude that our alleged
“cause” had the alleged “effect”, we must ascertain that a) the cause preceded the effect in time;
b) the change in the cause can be linked to the change in the effect; and c) no plausible other
actors could have caused the change we observe in the effect. For example, were the higher
levels of math achievement found in children taught with new curricula caused by the curricula?
Or are other factors responsible for the math gain?
Possible alternative explanations for the effects are frequently numerous. It is critical that
plausible “causal factors” that were not amenable to measurement in an evaluation are at least
identified in an evaluation write-up..
Statistical Conclusion Validity
Statistical conclusion validity refers to the question, Do the numbers we generate accurately
detect the presence of a factor, relationship, or effect of a specific or reasonable magnitude?
For example, is the proposed design and analysis approach capable of detecting an increase in
reading achievement over 3 months between children taught with a new curriculum approach
versus children taught with the traditional approach? Is the proposed approach capable of
detecting mortality changes as small as 2 percent? Is it possible the methodological weaknesses
in application of a statistical technique may have reduced (or increased) the likelihood of finding
a factor as an important predictor of the dependent variable?
The following tables list for each type of validity the threats to a particular validity and
reliability, the potential causes or a definition of the threats, and examples of the threats.
Measurement Validity
Measurement validity is concerned with the accuracy of measurement: Are we accurately
measuring what we really intend to measure?
Potential Causes/Defined
Evaluators have insufficient
Questions for some
knowledge about the concept
psychological concepts (e.g.,
of interest or the target
self-esteem, alienation) are
population with which the
standard; for other concepts
concept will be measured, or
(e.g., legal quality, sexual
the concept is impossible or
harassment), the means of
too expensive to measure
operationalizing the
directly so approximate or
occurrence of the concepts are
“proxy measures,” are used.
still being explored.
Respondent intentionally
An agency official or program
distorts facts to hide a
participant provides an answer
that is technically accurate but
is misleading as to the essence
of the inquiry.
Sleeper Effects
Faulty memory, or records are
not updated in a timely
manner. Accidental
misrepresentation is especially
a problem when significant
calendar time has elapsed.
The respondent tells the
interviewer what he or she
believes the interviewer wants
to hear with the aim of
receiving approval or a desire
to please.
Effects lag beyond the time of
measurement. In other words,
what’s being measured may be
right, but the measurement is
Inventory labels may not
reflect what is in boxes, or a
warehouse’s computerized
inventory list may not match
what is found in the
An agency official or program
participant unintentionally
gives false information due to
faulty memory of facts or
Computerized inventory
records may not be updated in
a timely manner, creating a
misleading impression of
amounts in storage.
Agency officials report that
financial records accurately
reflect inventory.
The effects of television
viewing on children’s attitudes
may not be immediate but
may be long-term.
Measurement Validity
Measurement validity is concerned with the accuracy of measurement: Are we accurately
measuring what we really intend to measure?
Potential Causes/Defined
being taken at the wrong time.
Other examples are business
cycles, cycles in
unemployment rates, or
participation in welfare
Redefining the data describing What is considered a “family”
Change in Definitions
or monitoring an entity makes for qualifying for welfare?
data from two or more time
periods not comparable.
What is considered a
“misdemeanor” or a “felony”?
Measuring a treatment as
Assuming that persons
Lack of Dosage
received or not received when enrolled in a program receive
in fact program participants
the same amount of services,
receive widely varying
students in class receive the
amounts of “treatment” (i.e.,
same amount of training, or
program services or policy)
taxpayers receive the same
due to groupings, geographic
level of scrutiny.
areas, individuals, etc.
Another type of treatment
distortion is introduced when
survey recipients give
inaccurate information about
the programs they participate
in or the benefits they receive.
Any one operationalization of Measuring attainment of a job
Mono-Operation Bias
a construct may
as a measure of the
underrepresent the construct
effectiveness of a job training
of interest or measure
irrelevant constructs,
complicating inference.
When only one method is used Using only Body Mass Index
Mono-Method Bias
to operationalize the concept
(BMI) to measure obesity; or
(e.g., self-report).
using only self-reports on
amount of time spent
Strategies to Enhance Measurement Validity
Ask relevant experts to examine proposed measures or use previously validated
measures, i.e., face validity.
Ascertain whether or not the measures covary with other variables with which you would
expect them to covary, i.e., predictive validity.
Test whether the measure predict the appropriate consequences, i.e., predictive validity.
Use multiple measures wherever possible.
Precisely delineate operational means of measuring a concept.
Measurement Reliability
Measurement reliability is the extent to which a measurement can be expected to produce similar
results on repeated observations of the same condition or event. Reliability pertains to both
reliable measures and reliable measurement.
Potential Causes/Defined
Questions are translated into
Questions that include words
Lost in Translation
multiple languages but the
such as political and
words do not really capture
bureaucratic are not easily
the same concepts.
transferred into multiple
Questions rely too heavily on
Questions that ask respondents
Multiple Judgment Calls
subjective assessments and
to make distinctions between
different respondents may
adjectives that may be
view the adjectives differently interpreted differently, such a
such as “poor, fair, average
and above average” may elicit
different responses
Inputting data from multiple
Busy front-line social service
Capacity Dependent
locations may be overly
delivery staff, e.g., social
dependent upon the capacity
workers, may not have the
of those responsible for
time to input data; and staff in
collecting and/or coding the
developing countries may not
data to carefully apply the
have the time nor
same criteria in their decisions technological support to input
on how to collect or code, and the data.
high turnover, heavy
workloads and/or lack of
technical capacity may render
the collection/coding
inconsistent across locations
Insufficient training of data
Overly ambitious timelines
Premature or Insufficiently
collectors, interviewers,
may push collection into the
Prepared Data Collection
observers, and/or coders may
field too quickly, or efforts to
and/or Coding
render collection and/or
save resources by cutting
coding inconsistent
training may leave staff
unprepared to ensure
consistent collection and/or
Strategies to Enhance Reliable Measures
 Focus measures at the appropriate level of analysis. For example, using percentage of job
placements and clients’ average length of employment after placement for each U.S.
Department of Labor Job Training Partnership program site to measure the labor force
participation of program enrollees.
 Ensure the measure used provides appropriate levels of calibration. For example, if you
are examining the efficiency of Food Stamp program operations at the local level, you
may not develop reliable measures of efficiency if you only gather information at the
state and national levels. Also, if you are rounding numbers to the nearest millions, you
may not find errors in the thousands. You need to compare apples and apples to ensure
that your scales of measure are consistent and appropriate for the assignment question.
 Take extra care when translating survey into multiple languages.
 Stay away from overly ambiguous adjectives, such as poor, average, excellent, and
somewhat, in the wording of questions
Strategies to Enhance Reliable Measurement
 Consistently record data.
 Train data collectors to enhance inter-observer (or coder) and intra-observer (or coder)
reliability. (It’s advisable to conduct inter-observer reliability checks whenever feasible.)
 Continually use training to maintain reliable coders.
 Use multiple items to measure concepts so that the relationship among the items can be
empirically analyzed.
Threats to both Internal Validity and External Validity
Internal validity is concerned with our ability to determine whether A caused B and in
what magnitude: Are we able to definitely establish that there is a causal relationship
between a specified cause and potential effect?
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Note that virtually any threat to internal validity also affects external validity.
History or Intervening
Testing or the
Learning Curve
Potential Causes/Defined
The observed effect is due not
to the program or treatment but
to some other event that has
taken place. For example,
while a program is operating,
many events may intervene that
could distort pre- and postmeasurements as they relate to
the outcome being studied.
The observed effect is due not
to the program but to the
respondents growing older,
wise, stronger/weaker, etc. over
The observed effect being due
to taking a test or being
observed/measured several
times. In a pre- and post-test
design, group members could
have scored better in the postperiod because they were more
A dramatic increase in
media coverage on AIDS
distorts the measurements
about the effect of a schoolbased program.
Juveniles often outgrow
delinquent behavior as they
age, making it difficult to
disentangle maturation
effects from the effects of a
new community program.
As the elderly age, their
health problems may
become more pronounced,
leading to an
underestimation of the
actual success of an
exercise program to
increase mobility (i.e., they
would have been even
worse off without the
exercise program).
Participants in a training
program learned from the
test rather than from the
Threats to both Internal Validity and External Validity
Internal validity is concerned with our ability to determine whether A caused B and in
what magnitude: Are we able to definitely establish that there is a causal relationship
between a specified cause and potential effect?
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Note that virtually any threat to internal validity also affects external validity.
Potential Causes/Defined
familiar with the test or
measurement process and test
If inadequate resources or other
Program Not Fully
factors have led to
implementation problems, it is
premature to test for effects.
Even when programs or
interventions have been
implemented as prescribed by
law, it is still wise for
evaluators to measure the
extent to which program
participants or service
recipients actually received the
Regression to the Mean The observed effect is due to
or Regression Artifacts the selection of a sample on the
basis of extremely high or
extremely low scores of some
variable of interest. Change in
the scores or values on the
criterion of interest may be due
to a natural tendency for
extremely high or extremely
low performers to fall back
toward the average value. It
would be misleading to
attribute this change to the
Did the parolees designated
to receive group counseling
actually attend all sessions?
Did the teachers all receive
the training to implement
the new curriculum?
Participant exam scores,
crime rates, and claims
processing rates are all
likely to rise and fall over
These threats arise when a
program or other intervention
occurs at or near a crisis point.
To the degree that the
Threats to both Internal Validity and External Validity
Internal validity is concerned with our ability to determine whether A caused B and in
what magnitude: Are we able to definitely establish that there is a causal relationship
between a specified cause and potential effect?
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Note that virtually any threat to internal validity also affects external validity.
Selection or Selection
Potential Causes/Defined
fluctuation is random or
occurrence idiosyncratic due to
some cause of short duration, it
is easy to incorrectly estimate
to effects of whatever action or
response is made.
The observed effect is due to
preexisting differences between
the types of individuals in the
study and comparison groups
rather than to the treatment or
program experience.
When the assignment of
subjects to comparison and
treatment groups is not random,
the groups may differ in the
variable being measured.
“Volunteerism” can have a
significant effect of its own.
Individuals drop out of an
experimental or treatment
group between the pre-test and
the post-test, potentially
exaggerating the magnitude of
the observed effect because
subjects who drop out of a
program may have
characteristics that differ from
those who remain. Therefore,
before-and-after comparisons
may not be valid.
Selection biases result in
differential rates of
“maturation” or autonomous
Those who volunteer for a
health promotion program
may already be different
(healthier) than those who
do not.
More highly motivated
teens remain in a program
designed to increase the
teens’ self-esteem.
Volunteers for a job training
program may be more
disposed to follow the
Threats to both Internal Validity and External Validity
Internal validity is concerned with our ability to determine whether A caused B and in
what magnitude: Are we able to definitely establish that there is a causal relationship
between a specified cause and potential effect?
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Note that virtually any threat to internal validity also affects external validity.
Measurement Effects
Situational Effects
(Hawthorne, staff,
Potential Causes/Defined
change within the treatment
group. There may also be an
interaction between selection
biases and any of the other
A pre-test or the process of
taking observations may have a
systematic effect on
respondents, thus making the
results obtained for a pretested
or observed population
unrepresentative of the
unpretested universe.
The observed effect is due to
multiple factors associated with
the experiment or study itself,
such as the extent to which
people are aware they are part
of a study (Hawthorne effect),
the newness of a program, and
the particular time period in
which a study takes place. This
threat also includes atypical
situation effects that make the
selected context
nonrepresentative on some
When a treatment provides
desirable goods or services,
administrators or staff may
provide compensatory goods or
services to those not receiving
advice offered them.
Participants not receiving a
Comparison group members
Training participants who
have taken a pretest may be
sensitive to the intent of the
training and pay more
attention to the information
highlighted by the test.
Instructors selected to offer
new training on sexual
harassment in an agency
may be unusually
enthusiastic due to the
unique and timely nature of
the topic.
Teachers who are not
implementing a new math
curriculum (i.e., they teach
a comparison group) work
harder with the students.
Threats to both Internal Validity and External Validity
Internal validity is concerned with our ability to determine whether A caused B and in
what magnitude: Are we able to definitely establish that there is a causal relationship
between a specified cause and potential effect?
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Note that virtually any threat to internal validity also affects external validity.
Treatment Diffusion
Ambiguous Temporal
Potential Causes/Defined
desirable treatment may be so
resentful or demoralized that
they may respond more
negatively than otherwise.
Participants may receive
services from a condition to
which they were not assigned,
or learn from participants in the
treatment group.
Lack of clarify about which
variable occurred first may
yield confusion about which
variable is the cause and which
is the effect.
seek out training or
treatment from other
Students from new math
curriculum treatment and
comparison groups study
math together outside of
Schools in high income
areas may adopt healthy
food policies (e.g.,
removing soda machines) in
response to parent demands
(as the parents already are
pushing healthy eating at
Strategies for Enhancing Internal Validity
Carefully design the study to rule out or estimate the effect of potential competing
Identify other potential “causes” of the “effects” prior to collecting data so that
these other variables can be measured.
Question carefully findings regarding covariation to identify other preceding,
intervening, or interactive variables that produce variation in the “effect” of
Additional Threats to External Validity or Generalizability
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
Program results may only
General Selection
be applicable to the
population/context that is
directly studied. This
threat occurs by
reviewing or studying
nonrepresentative cases,
situations, or people.
Selection by Excellence
We may observe a
situation because we
believe it provides the
best chance of seeing a
hypothesized effect (e.g.,
the Job Corps increases
the probability that
teenagers will obtain
jobs). However, a sound
estimate of effect for an
excellent program in one
city may not be replicable
in other locations. Thus,
we may have only a “best
practice” estimate.
Selection by Expedience We may observe a
situation because it is
accessible (e.g., available
travel funds, proximity,
persons who are willing
to be interviewed). This
is often a dangerous
practice in that we have
no way of knowing how
representative the results
Selection by Problem
We may choose to look
at locations or programs
because we have some
reason to believe that
there is a severe problem
there; e.g., we have some
Additional Threats to External Validity or Generalizability
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
reason to believe that
there is a contamination
problem at a particular
General Selection
nuclear weapons
Effects (continued)
production plant.
Selection by “Where the
We may observe
Ducks Are”
locations or programs
because they correspond
to where large amount of
dollars spent or large
amounts of people are
served. In this case, we
are balancing limited
resources, maximum
payoff, and
Again, we need to be
careful not to generalize
to the universe of
locations and programs,
but if may not matter
much to us if our chosen
locations/groups account
for a very large
proportion (70 percent)
of all dollars or activities.
Time Effects
The time frame of our
The performance of a
weapons system tested
during the day may bear
Or when using secondary no relationship to its
data from other
performance at night.
researchers, the data may
be so outdated that they
are no longer relevant to
the problem. Thus,
although we may have a
sound evaluation of some
past regulation, policy, or
program, there is no
reason to believe that it
bears any relationship to
Additional Threats to External Validity or Generalizability
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
what is going on
The evaluation may have A drug intervention
Geographic Effects
been conducted in a
program for urban youth
specific area of the
in Chicago may not
country or type of
provide guidance on what
environment and its
should be done in rural
results are not
generalizable to other
A number of treatments
A drug abuse program
Multiple Treatment
or programs are jointly
designed for preteens
Interference Effect
applied and the effects
may include several
are confounded and not
components (e.g.,
representative of the
lectures, essay contests),
effects of a separate
making it difficult to
application of any one
separate out the effects of
treatment or program.
the different components.
Treatments are complex,
and replications of them
may fail to include those
components actually
responsible for the
effects. An effect found
with one treatment
variation might not hold
with other variations of
that treatment, or when
that treatment is
combined with other
treatments, or when only
part of that treatment is
An effect found with
When results are reported
Interaction of the
certain kinds of units
for schools rather than
Causal Relationship
might not hold if other
individual students.
with Units
kinds of units had been
An effect found in one
New health programs
Interactions of the
kind of setting may not
tried in Central America
Causal Relationship
hold if other kinds of
may not work in
with Settings
Additional Threats to External Validity or Generalizability
External validity is concerned with our ability to generalize beyond the groups or
context being studied: Are we able to generalize from the results?
settings were to be used.
predominantly Muslim
An explanatory mediator New health curricula may
of a causal relationship in work in mixed sex
one context may not
schools but not in single
mediate in another
sex schools (or vice
Strategies for Enhancing External Validity
Identify all pertinent subgroups prior to selecting a sample.
Stratify random sampling; that is, draw samples from within the subgroups of the
population to which generalization is desired.
Boost sample size within pertinent subgroups.
Validate aggregate results with experts.
Statistical Conclusion Validity
Statistical conclusion validity is concerned with our ability to detect an effect, a
relationship, or a factor, if it is present, and/or the magnitude of an effect: Do the
numbers we generate accurately detect the presence of a factor, relationship, or effect of
a specific or reasonable magnitude?
Potential Causes/Defined
An effect or relationship of An effect of some
Too Small a Sample Size
a specific size, regardless of magnitude in math
the analytic approach used, achievement due to a new
is not statistically detected; curriculum is not detected
there is low statistical
because too few students
power due to small sample
are included in the study.
Appropriateness of the
T-tests should not be
Applying Statistical
technique given the data
applied to ordinal measures
Analyses to Data
and the underlying
(e.g., Likert 5-point scales),
Inappropriate for the
dynamics in measured
and nominal and short
relationships. Application
ordinal variables should be
of inappropriate statistical
converted to dummy
techniques for the data at
variables for use in
hand may produce numbers regression.
that are misleading or
incorrect. Each statistical
technique is designed for
application to certain types
of data ( i.e., nominal,
ordinal and interval/ratio),
and for certain types of
relationships between
variables, e.g., linear.
It may very well be that a
A t-test of means applied to
Violation of Assumptions
particular type of test may
two groups of respondents
Unique to a Statistical
not have sufficient power to in which the variability is
detect an effect or
quite different may not
relationship that is present
provide an accurate test of
but that another technique
differences. Regular OLS
will be able to do so.
Regression should not be
Differences depend on
used to model non-linear
assumptions made by the
statistical techniques.
If a measure has a high
If attitudinal scales contain
Measurement Problems
degree of error, it threatens adjectives that may have
our ability to statistically
various connotations for
identify relationships or
respondents (e.g., good,
differences and effects that
fair, outstanding), the
are actually present; or
other measurement
problems such as unreliable
proxy variables, or limited
range in variables of
Fishing and the Error
Rate Problem
Repeated tests for
significant relationships, if
uncorrected for the number
of tests, can artificially
inflate statistical
Unreliability of Treatment If a treatment that is
intended to be implemented
in a standardized manner is
implemented only partially
for some respondents,
effects may be
underestimated compared
with full implementation.
Over-fitting, e.g., sample to
Overfitting Models
variables ratios for entire
sample and for key groups.
respondents may not be
comparable across the
sample, or if proxy
measures are used and are
inconsistently affected by
other factors in the
environment, or if age is an
important independent
variable but in your sample
the participants are only
between 21 and 28 years of
For example, when large
numbers of correlations or
regression coefficients are
tested in one study with a
95% confidence rule, at
least 5% of the tests could
be false positives.
When treatments or
programs are implemented
in a variety of contexts, the
results may not be
statistically generalizable to
all contexts.
When too many
independent variables are
given a certain sample size,
the mathematical
computations may result in
showing inflated levels of
both correlation and
statistical significance, say
15 predictors in a regression
using a sample of 50 units.
Specification Error
Specification effects may
include either omission of
other factors that may affect
the outcomes of interest
(similar to the history threat
under internal validity) or
inclusion of factors that are
not relevant in an analytical
model devised to predict
specific outcomes.
When irrelevant variables
are included in a regression
model they may inflate the
coefficient of determination
(R2) but not truly help
predict the dependent
variable of interest, and
they may be collinear with
predictors that are
important, and thus reduce
the statistical significance
of these more relevant
Strategies for Enhancing Statistical Conclusion Validity
Select appropriate analytical techniques.
Draw an appropriate sample size.
Select appropriate units of analysis.
Ensure adequate variance in variables.
Consistently apply appropriate pre-set decision rules.
Provide all statistics that the audience may need to make informed judgments
about the meaning of the analytical results.
Reduce measurement error through more precise measures and/or use of
multiple measures.