Strategies to Help Strengthen Validity and Reliability of Data Copyright Kathryn E. Newcomer January 2011 Validity Validity generally refers to the accuracy and representativeness of a study or data, and the term is sometimes casually used to characterize whole studies as valid, or “scientifically valid.” But validity is a multidimensional concept, and since its many dimensions are never attained perfectly, the question is not “Is the study or are the data valid” but “How valid are they in terms of measurement, internal, external, and statistical conclusion validity?" Limitations to any of the dimensions will reduce the ability to draw causal inferences (internal validity.) One might view the types of validity as a pyramid with measurement validity and reliability on the bottom, then external and statistical conclusion, with internal validity on the top. And each limitation affects all of the dimensions of validity above it. 1 Measurement Validity Measurement validity refers to the question, Are we accurately measuring what we really intend to measure? Measurement validity is concerned with the accuracy of measurement. We start with a concept, such as maternal health, and we identify measurement procedures that we can use to operationalize the more abstract concept into empirically-observable indicators. For example, we may measure student reading ability with standardized reading exams. Evaluators may record the empirical indicators from existing data sets or create new ways to measure concepts of interest. Measurement validity refers to whether or not the empirical indicators accurately portray the concept of interest. Any time we use a proxy variable for something we want to capture or measure, we have to consider the degree to which our measure is valid. We need to make judgments about the adequacy of our measures, and we typically try to validate them by testing how well they capture what we want to measure. Measurement Reliability Measurement reliability is the extent to which a measurement can be expected to produce similar results on repeated observations of the same condition or event. Is a question asked in the same way? Is information collected in the same way from one item to the next? Would anyone get the same answer if they repeated the question or data collection task? If not, your evidence may not be reliable or, therefore, competent. Measurement reliability pertains to both reliable measures and reliable measurement. Reliable measures means that operations consistently measure the same phenomena. Reliable measurement means consistently recording data with the same decision criteria. 2 External Validity or Generalizability External validity refers to the question, Are we able to generalize from the results? If a study is generalizable (or externally valid), we are able to apply the study’s results to groups or contexts beyond those being studied. For example, if we studied occurrences at three high-priority Superfund sites, to what degree could the study’s results be generalized to the overall management of the Superfund program? How large would a survey sample need to be to generalize to the universe of high-school students in the U.S.? Internal Validity Internal validity refers to the question, Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? If a study is internally valid, it means that it can be determined whether A (a program, policy, external event, regulation, management action, etc.) caused B (a gain in reading, a drop in employment, a change in mortality, etc.) and in what magnitude. To conclude that our alleged “cause” had the alleged “effect”, we must ascertain that a) the cause preceded the effect in time; b) the change in the cause can be linked to the change in the effect; and c) no plausible other actors could have caused the change we observe in the effect. For example, were the higher levels of math achievement found in children taught with new curricula caused by the curricula? Or are other factors responsible for the math gain? Possible alternative explanations for the effects are frequently numerous. It is critical that plausible “causal factors” that were not amenable to measurement in an evaluation are at least identified in an evaluation write-up.. Statistical Conclusion Validity Statistical conclusion validity refers to the question, Do the numbers we generate accurately detect the presence of a factor, relationship, or effect of a specific or reasonable magnitude? For example, is the proposed design and analysis approach capable of detecting an increase in reading achievement over 3 months between children taught with a new curriculum approach versus children taught with the traditional approach? Is the proposed approach capable of detecting mortality changes as small as 2 percent? Is it possible the methodological weaknesses in application of a statistical technique may have reduced (or increased) the likelihood of finding a factor as an important predictor of the dependent variable? The following tables list for each type of validity the threats to a particular validity and reliability, the potential causes or a definition of the threats, and examples of the threats. 3 Measurement Validity Measurement validity is concerned with the accuracy of measurement: Are we accurately measuring what we really intend to measure? Threats Potential Causes/Defined Examples Evaluators have insufficient Questions for some Inappropriate knowledge about the concept psychological concepts (e.g., Operationalization of interest or the target self-esteem, alienation) are population with which the standard; for other concepts concept will be measured, or (e.g., legal quality, sexual the concept is impossible or harassment), the means of too expensive to measure operationalizing the directly so approximate or occurrence of the concepts are “proxy measures,” are used. still being explored. Respondent intentionally An agency official or program Purposeful distorts facts to hide a participant provides an answer Misrepresentation problem. that is technically accurate but is misleading as to the essence of the inquiry. Accidental Misrepresentation Social Desirability/Evaluation Apprehension Sleeper Effects Faulty memory, or records are not updated in a timely manner. Accidental misrepresentation is especially a problem when significant calendar time has elapsed. The respondent tells the interviewer what he or she believes the interviewer wants to hear with the aim of receiving approval or a desire to please. Effects lag beyond the time of measurement. In other words, what’s being measured may be right, but the measurement is Inventory labels may not reflect what is in boxes, or a warehouse’s computerized inventory list may not match what is found in the warehouse. An agency official or program participant unintentionally gives false information due to faulty memory of facts or events. Computerized inventory records may not be updated in a timely manner, creating a misleading impression of amounts in storage. Agency officials report that financial records accurately reflect inventory. The effects of television viewing on children’s attitudes may not be immediate but may be long-term. 4 Measurement Validity Measurement validity is concerned with the accuracy of measurement: Are we accurately measuring what we really intend to measure? Threats Potential Causes/Defined Examples being taken at the wrong time. Other examples are business cycles, cycles in unemployment rates, or participation in welfare programs. Redefining the data describing What is considered a “family” Change in Definitions or monitoring an entity makes for qualifying for welfare? data from two or more time periods not comparable. What is considered a “misdemeanor” or a “felony”? Measuring a treatment as Assuming that persons Lack of Dosage received or not received when enrolled in a program receive Differentiation in fact program participants the same amount of services, receive widely varying students in class receive the amounts of “treatment” (i.e., same amount of training, or program services or policy) taxpayers receive the same due to groupings, geographic level of scrutiny. areas, individuals, etc. Another type of treatment distortion is introduced when survey recipients give inaccurate information about the programs they participate in or the benefits they receive. Any one operationalization of Measuring attainment of a job Mono-Operation Bias a construct may as a measure of the underrepresent the construct effectiveness of a job training of interest or measure program. irrelevant constructs, complicating inference. When only one method is used Using only Body Mass Index Mono-Method Bias to operationalize the concept (BMI) to measure obesity; or (e.g., self-report). using only self-reports on amount of time spent studying. 5 Strategies to Enhance Measurement Validity Ask relevant experts to examine proposed measures or use previously validated measures, i.e., face validity. Ascertain whether or not the measures covary with other variables with which you would expect them to covary, i.e., predictive validity. Test whether the measure predict the appropriate consequences, i.e., predictive validity. Use multiple measures wherever possible. Precisely delineate operational means of measuring a concept. 6 Measurement Reliability Measurement reliability is the extent to which a measurement can be expected to produce similar results on repeated observations of the same condition or event. Reliability pertains to both reliable measures and reliable measurement. Threats Potential Causes/Defined Examples Questions are translated into Questions that include words Lost in Translation multiple languages but the such as political and words do not really capture bureaucratic are not easily the same concepts. transferred into multiple languages Questions rely too heavily on Questions that ask respondents Multiple Judgment Calls subjective assessments and to make distinctions between different respondents may adjectives that may be view the adjectives differently interpreted differently, such a such as “poor, fair, average and above average” may elicit different responses Inputting data from multiple Busy front-line social service Capacity Dependent locations may be overly delivery staff, e.g., social Collection/Coding dependent upon the capacity workers, may not have the of those responsible for time to input data; and staff in collecting and/or coding the developing countries may not data to carefully apply the have the time nor same criteria in their decisions technological support to input on how to collect or code, and the data. high turnover, heavy workloads and/or lack of technical capacity may render the collection/coding inconsistent across locations Insufficient training of data Overly ambitious timelines Premature or Insufficiently collectors, interviewers, may push collection into the Prepared Data Collection observers, and/or coders may field too quickly, or efforts to and/or Coding render collection and/or save resources by cutting coding inconsistent training may leave staff unprepared to ensure consistent collection and/or coding 7 Strategies to Enhance Reliable Measures Focus measures at the appropriate level of analysis. For example, using percentage of job placements and clients’ average length of employment after placement for each U.S. Department of Labor Job Training Partnership program site to measure the labor force participation of program enrollees. Ensure the measure used provides appropriate levels of calibration. For example, if you are examining the efficiency of Food Stamp program operations at the local level, you may not develop reliable measures of efficiency if you only gather information at the state and national levels. Also, if you are rounding numbers to the nearest millions, you may not find errors in the thousands. You need to compare apples and apples to ensure that your scales of measure are consistent and appropriate for the assignment question. Take extra care when translating survey into multiple languages. Stay away from overly ambiguous adjectives, such as poor, average, excellent, and somewhat, in the wording of questions Strategies to Enhance Reliable Measurement Consistently record data. Train data collectors to enhance inter-observer (or coder) and intra-observer (or coder) reliability. (It’s advisable to conduct inter-observer reliability checks whenever feasible.) Continually use training to maintain reliable coders. Use multiple items to measure concepts so that the relationship among the items can be empirically analyzed. 8 Threats to both Internal Validity and External Validity Internal validity is concerned with our ability to determine whether A caused B and in what magnitude: Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Note that virtually any threat to internal validity also affects external validity. Threats History or Intervening Events Maturation Testing or the Learning Curve Potential Causes/Defined The observed effect is due not to the program or treatment but to some other event that has taken place. For example, while a program is operating, many events may intervene that could distort pre- and postmeasurements as they relate to the outcome being studied. The observed effect is due not to the program but to the respondents growing older, wise, stronger/weaker, etc. over time. The observed effect being due to taking a test or being observed/measured several times. In a pre- and post-test design, group members could have scored better in the postperiod because they were more Examples A dramatic increase in media coverage on AIDS distorts the measurements about the effect of a schoolbased program. Juveniles often outgrow delinquent behavior as they age, making it difficult to disentangle maturation effects from the effects of a new community program. As the elderly age, their health problems may become more pronounced, leading to an underestimation of the actual success of an exercise program to increase mobility (i.e., they would have been even worse off without the exercise program). Participants in a training program learned from the test rather than from the program. 9 Threats to both Internal Validity and External Validity Internal validity is concerned with our ability to determine whether A caused B and in what magnitude: Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Note that virtually any threat to internal validity also affects external validity. Threats Potential Causes/Defined familiar with the test or measurement process and test situation. If inadequate resources or other Program Not Fully factors have led to Implemented implementation problems, it is premature to test for effects. Even when programs or interventions have been implemented as prescribed by law, it is still wise for evaluators to measure the extent to which program participants or service recipients actually received the benefit. Regression to the Mean The observed effect is due to or Regression Artifacts the selection of a sample on the basis of extremely high or extremely low scores of some variable of interest. Change in the scores or values on the criterion of interest may be due to a natural tendency for extremely high or extremely low performers to fall back toward the average value. It would be misleading to attribute this change to the intervention. Examples Did the parolees designated to receive group counseling actually attend all sessions? Did the teachers all receive the training to implement the new curriculum? Participant exam scores, crime rates, and claims processing rates are all likely to rise and fall over time. These threats arise when a program or other intervention occurs at or near a crisis point. To the degree that the 10 Threats to both Internal Validity and External Validity Internal validity is concerned with our ability to determine whether A caused B and in what magnitude: Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Note that virtually any threat to internal validity also affects external validity. Threats Selection or Selection Bias Experimental Mortality Selection-Maturation Interaction Potential Causes/Defined fluctuation is random or occurrence idiosyncratic due to some cause of short duration, it is easy to incorrectly estimate to effects of whatever action or response is made. The observed effect is due to preexisting differences between the types of individuals in the study and comparison groups rather than to the treatment or program experience. When the assignment of subjects to comparison and treatment groups is not random, the groups may differ in the variable being measured. “Volunteerism” can have a significant effect of its own. Individuals drop out of an experimental or treatment group between the pre-test and the post-test, potentially exaggerating the magnitude of the observed effect because subjects who drop out of a program may have characteristics that differ from those who remain. Therefore, before-and-after comparisons may not be valid. Selection biases result in differential rates of “maturation” or autonomous Examples Those who volunteer for a health promotion program may already be different (healthier) than those who do not. More highly motivated teens remain in a program designed to increase the teens’ self-esteem. Volunteers for a job training program may be more disposed to follow the 11 Threats to both Internal Validity and External Validity Internal validity is concerned with our ability to determine whether A caused B and in what magnitude: Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Note that virtually any threat to internal validity also affects external validity. Threats Measurement Effects Situational Effects (Hawthorne, staff, novelty) Compensatory Equalization Resentful Potential Causes/Defined change within the treatment group. There may also be an interaction between selection biases and any of the other threats. A pre-test or the process of taking observations may have a systematic effect on respondents, thus making the results obtained for a pretested or observed population unrepresentative of the unpretested universe. The observed effect is due to multiple factors associated with the experiment or study itself, such as the extent to which people are aware they are part of a study (Hawthorne effect), the newness of a program, and the particular time period in which a study takes place. This threat also includes atypical situation effects that make the selected context nonrepresentative on some dimension. When a treatment provides desirable goods or services, administrators or staff may provide compensatory goods or services to those not receiving treatment. Examples advice offered them. Participants not receiving a Comparison group members Training participants who have taken a pretest may be sensitive to the intent of the training and pay more attention to the information highlighted by the test. Instructors selected to offer new training on sexual harassment in an agency may be unusually enthusiastic due to the unique and timely nature of the topic. Teachers who are not implementing a new math curriculum (i.e., they teach a comparison group) work harder with the students. 12 Threats to both Internal Validity and External Validity Internal validity is concerned with our ability to determine whether A caused B and in what magnitude: Are we able to definitely establish that there is a causal relationship between a specified cause and potential effect? External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Note that virtually any threat to internal validity also affects external validity. Threats Demoralization Treatment Diffusion Ambiguous Temporal Precedence Potential Causes/Defined desirable treatment may be so resentful or demoralized that they may respond more negatively than otherwise. Participants may receive services from a condition to which they were not assigned, or learn from participants in the treatment group. Lack of clarify about which variable occurred first may yield confusion about which variable is the cause and which is the effect. Examples seek out training or treatment from other sources. Students from new math curriculum treatment and comparison groups study math together outside of school. Schools in high income areas may adopt healthy food policies (e.g., removing soda machines) in response to parent demands (as the parents already are pushing healthy eating at home). 13 Strategies for Enhancing Internal Validity Carefully design the study to rule out or estimate the effect of potential competing factors. Identify other potential “causes” of the “effects” prior to collecting data so that these other variables can be measured. Question carefully findings regarding covariation to identify other preceding, intervening, or interactive variables that produce variation in the “effect” of interest. 14 Additional Threats to External Validity or Generalizability External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Threats Potential Examples Causes/Defined Program results may only General Selection be applicable to the Effects population/context that is directly studied. This threat occurs by reviewing or studying nonrepresentative cases, situations, or people. Selection by Excellence We may observe a situation because we believe it provides the best chance of seeing a hypothesized effect (e.g., the Job Corps increases the probability that teenagers will obtain jobs). However, a sound estimate of effect for an excellent program in one city may not be replicable in other locations. Thus, we may have only a “best practice” estimate. Selection by Expedience We may observe a situation because it is accessible (e.g., available travel funds, proximity, persons who are willing to be interviewed). This is often a dangerous practice in that we have no way of knowing how representative the results are. Selection by Problem We may choose to look Severity at locations or programs because we have some reason to believe that there is a severe problem there; e.g., we have some 15 Additional Threats to External Validity or Generalizability External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Threats Potential Examples Causes/Defined reason to believe that there is a contamination problem at a particular General Selection nuclear weapons Effects (continued) production plant. Selection by “Where the We may observe Ducks Are” locations or programs because they correspond to where large amount of dollars spent or large amounts of people are served. In this case, we are balancing limited resources, maximum payoff, and representativeness. Again, we need to be careful not to generalize to the universe of locations and programs, but if may not matter much to us if our chosen locations/groups account for a very large proportion (70 percent) of all dollars or activities. Time Effects The time frame of our The performance of a observations. weapons system tested during the day may bear Or when using secondary no relationship to its data from other performance at night. researchers, the data may be so outdated that they are no longer relevant to the problem. Thus, although we may have a sound evaluation of some past regulation, policy, or program, there is no reason to believe that it bears any relationship to 16 Additional Threats to External Validity or Generalizability External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Threats Potential Examples Causes/Defined what is going on currently. The evaluation may have A drug intervention Geographic Effects been conducted in a program for urban youth specific area of the in Chicago may not country or type of provide guidance on what environment and its should be done in rural results are not Alabama. generalizable to other settings. A number of treatments A drug abuse program Multiple Treatment or programs are jointly designed for preteens Interference Effect applied and the effects may include several are confounded and not components (e.g., representative of the lectures, essay contests), effects of a separate making it difficult to application of any one separate out the effects of treatment or program. the different components. Treatments are complex, and replications of them may fail to include those components actually responsible for the effects. An effect found with one treatment variation might not hold with other variations of that treatment, or when that treatment is combined with other treatments, or when only part of that treatment is used. An effect found with When results are reported Interaction of the certain kinds of units for schools rather than Causal Relationship might not hold if other individual students. with Units kinds of units had been studied. An effect found in one New health programs Interactions of the kind of setting may not tried in Central America Causal Relationship hold if other kinds of may not work in with Settings 17 Additional Threats to External Validity or Generalizability External validity is concerned with our ability to generalize beyond the groups or context being studied: Are we able to generalize from the results? Threats Potential Examples Causes/Defined settings were to be used. predominantly Muslim countries. An explanatory mediator New health curricula may Context-Dependent of a causal relationship in work in mixed sex Mediation one context may not schools but not in single mediate in another sex schools (or vice context. versa). 18 Strategies for Enhancing External Validity Identify all pertinent subgroups prior to selecting a sample. Stratify random sampling; that is, draw samples from within the subgroups of the population to which generalization is desired. Boost sample size within pertinent subgroups. Validate aggregate results with experts. 19 Statistical Conclusion Validity Statistical conclusion validity is concerned with our ability to detect an effect, a relationship, or a factor, if it is present, and/or the magnitude of an effect: Do the numbers we generate accurately detect the presence of a factor, relationship, or effect of a specific or reasonable magnitude? Threats Potential Causes/Defined Examples An effect or relationship of An effect of some Too Small a Sample Size a specific size, regardless of magnitude in math the analytic approach used, achievement due to a new is not statistically detected; curriculum is not detected there is low statistical because too few students power due to small sample are included in the study. size. Appropriateness of the T-tests should not be Applying Statistical technique given the data applied to ordinal measures Analyses to Data and the underlying (e.g., Likert 5-point scales), Inappropriate for the dynamics in measured and nominal and short Technique relationships. Application ordinal variables should be of inappropriate statistical converted to dummy techniques for the data at variables for use in hand may produce numbers regression. that are misleading or incorrect. Each statistical technique is designed for application to certain types of data ( i.e., nominal, ordinal and interval/ratio), and for certain types of relationships between variables, e.g., linear. It may very well be that a A t-test of means applied to Violation of Assumptions particular type of test may two groups of respondents Unique to a Statistical not have sufficient power to in which the variability is Technique detect an effect or quite different may not relationship that is present provide an accurate test of but that another technique differences. Regular OLS will be able to do so. Regression should not be Differences depend on used to model non-linear assumptions made by the relationships. statistical techniques. If a measure has a high If attitudinal scales contain Measurement Problems degree of error, it threatens adjectives that may have our ability to statistically various connotations for identify relationships or respondents (e.g., good, differences and effects that fair, outstanding), the 20 are actually present; or other measurement problems such as unreliable proxy variables, or limited range in variables of interest. Fishing and the Error Rate Problem Repeated tests for significant relationships, if uncorrected for the number of tests, can artificially inflate statistical significance. Unreliability of Treatment If a treatment that is intended to be implemented Implementation in a standardized manner is implemented only partially for some respondents, effects may be underestimated compared with full implementation. Over-fitting, e.g., sample to Overfitting Models variables ratios for entire sample and for key groups. respondents may not be comparable across the sample, or if proxy measures are used and are inconsistently affected by other factors in the environment, or if age is an important independent variable but in your sample the participants are only between 21 and 28 years of age. For example, when large numbers of correlations or regression coefficients are tested in one study with a 95% confidence rule, at least 5% of the tests could be false positives. When treatments or programs are implemented in a variety of contexts, the results may not be statistically generalizable to all contexts. When too many independent variables are given a certain sample size, the mathematical computations may result in showing inflated levels of both correlation and statistical significance, say 15 predictors in a regression using a sample of 50 units. 21 Specification Error Specification effects may include either omission of other factors that may affect the outcomes of interest (similar to the history threat under internal validity) or inclusion of factors that are not relevant in an analytical model devised to predict specific outcomes. When irrelevant variables are included in a regression model they may inflate the coefficient of determination (R2) but not truly help predict the dependent variable of interest, and they may be collinear with predictors that are important, and thus reduce the statistical significance of these more relevant predictors. Strategies for Enhancing Statistical Conclusion Validity Select appropriate analytical techniques. Draw an appropriate sample size. Select appropriate units of analysis. Ensure adequate variance in variables. Consistently apply appropriate pre-set decision rules. Provide all statistics that the audience may need to make informed judgments about the meaning of the analytical results. Reduce measurement error through more precise measures and/or use of multiple measures. 22