Packing and Unpacking Sources of Validity Evidence: History Repeats Itself Again Stephen G. Sireci University of Massachusetts Amherst Presentation for the conference “The Concept of Validity: Revisions, New Directions, & Applications” October 9, 2008 University of Maryland, College Park Validity A concept that has evolved and is still evolving The most important consideration in educational and psychological testing Simple, but complex – Can be misunderstood – Disagreements regarding what it is, and what is important Purposes of this presentation Provide some historical context on the concept of validity in testing Present current, consensus definitions of validity Describe the validation framework implied in the Standards Discuss limitations of current framework Suggest new directions for validity research and practice Packing and unpacking: A prelude Packing Does the test measure what it purports to measure? A test is valid for anything with which it correlates. Validity is a unitary concept. Unpacking Predictive, status, content, congruent validity Clarity, coherence, plausibility of assumptions (validity argument) 5 sources of validity evidence Validity defined What is validity? How have psychometricians come to define it? What does valid mean? Truth? According to Websters, Valid: 1. having legal force; properly executed and binding under the law. 2. sound; well grounded on principles or evidence; able to withstand criticism or rejection. 3. effective, effectual, cogent 4. robust, strong, healthy (rare) What is validity? According to Websters: Validity: 1. the state or quality of being valid; specifically, (a) strength or force from being supported by fact; justness; soundness; (b) legal strength or force. 2. strength or power in general 3. value (rare) How have psychometricians defined validity? Some History In the beginning In the beginning Modern measurement started at the turn of the 20th century 1905: Binet-Simon scale – 30-item scale designed to ensure that no child could be denied instruction in the Paris school system without formal examination • Binet died in 1911 at age 54 Note College Board was established in 1900 – Began essay testing in 1901 What else was happening around the turn of the century? 1896: Karl Pearson, Galton Professor of Eugenics at University College, published the formula for the correlation coefficient Given the predictive purpose of Binet’s test, interest in heredity and individual differences, and a new statistical formula relating variables to one another validity was initially defined in terms of correlation Earliest definitions of Validity “Valid scale” (Thorndike, 1913) A test is valid for anything with which it correlates – Kelley, 1927; Thurstone, 1932; – Bingham, 1937; Guilford (1946); others Validity coefficients – correlations of test scores with grades, supervisor ratings, etc. Validation started with group tests 1917: Army Alpha and Army Beta – (Yerkes) – Classification of 1.5 million recruits Borrowed items and ideas from Otis Tests – Otis was one of Terman’s graduate students Military Testing Tests were added or subtracted to batteries based solely on correlational evidence (e.g, increase in R2). How well does test predict pass/fail criterion several weeks later? Jenkins (1946) and others emerged in response to problems with notion that validity=correlation – See also Pressey (1920) Problems with notion that validity = correlation Finding criterion data Establishing reliability of criterion Establishing validity of criterion If valid, measurable, criteria exist, why do we need the test? What did critics of correlational evidence of validity suggest for validating tests? Professional judgment “...it is proper for the test developer to use his individual judgment in this matter though he should hardly accept it as being on par with, or as worthy of credence as, experimentally established facts showing validity.” – (Kelley, 1927, pp. 30-31) What did critics of correlational evidence of validity suggest for validating tests? Appraisal of test content with respect to the purpose of testing (Rulon, 1946) – rational relationship Sound familiar? Early notions of content validity – (Kelley, Mosier, Rulon, Thorndike, others) – but notice Kelley’s hesitation in endorsing this evidence, or going against the popular notion Other precursors to content validity Guilford (1946): validity by inspection? Gulliksen (1950): “Intrinsic Validity” – pre/post instruction test score change – consensus of expert judgment regarding test content – examine relationship of test to other tests measuring same objectives Herring (1918): 6 experts evaluated the “fitness of items” Development of Validity Theory By the 1950s, there was consensus that correlational evidence was not enough and that judgmental data of the adequacy of test content should be gathered Growing idea of multiple lines of “validity evidence” Emergence of Professional Standards Cureton (1951): First “Validity” chapter in first edition of “Educational Measurement” (edited by Lindquist). Two aspects of validity – Relevance (what we would call criterion-related) – Reliability Cureton (1951) Validity defined as “the correlation between actual test scores and true criterion scores” but: “curricular relevance or content validity” may be appropriate in some situations. Emergence of Professional Standards 1952: APA Committee on Test Standards – Technical Recommendations for Psychological Tests and Diagnostic Techniques: A Preliminary Proposal Four “categories of validity” – predictive, status, content, congruent Emergence of Professional Standards 1954: APA, AERA, & NCMUE produced – Technical Recommendations for Psychological Tests and Diagnostic Techniques Four “types” or “attributes” of validity: – construct validity (instead of congruent) – concurrent validity (instead of status) – predictive – content 1954 Standards Chair was Cronbach and guess who else was on the Committee? – Hint: A philosopher Promoted idea of: – different types of validity – multiple types of evidence preferred – some types preferable in some situations Subsequent Developments 1955: Cronbach and Meehl – Formally defined and elaborated the concept of construct validity. – Introduced term “criterion-related validity” 1956: Lennon – Formally defined and elaborated the concept of content validity. Subsequent Developments Loevinger (1957): big promoter of construct validity idea. Ebel (1961…): big antagonist of unified validity theory – Preferred “meaningfulness” Evolution of Professional Standards 1966: AERA, APA, NCME Standards for Educational and Psychological Tests and Manuals Three “aspects” of validity: – Criterion-related (concurrent + predictive) – Construct – Content 1966: Standards Introduced notion that test users are also responsible for test validity Specific testing purposes called for specific types of validity evidence. – Three “aims of testing” • present performance • future performance • standing on trait of interest Important developments in content validation Evolution of Professional Standards 1974: AERA, APA, NCME Standards for Educational and Psychological Tests Validity descriptions borrowed heavily from Cronbach (1971) – Validity chapter in 2nd edition of “Educational Measurement” (edited by R.L. Thorndike) 1974: Standards Defined content validity in operational, rather than theoretical, terms. Beginning of notion that construct validity is much cooler than content or criterion-related. Early consensus of “unitary” conceptualization of validity Evolution of Professional Standards 1985: AERA, APA, NCME Standards for Educational and Psychological Testing note “ing” Described validity as unitary concept Notion of validating score-based inferences Very Messick-influenced 1985 Standards More responsibility on test users More standards on applications and equity issues Separate chapters for – Validity – Reliability – Test development – Scaling, norming, equating – Technical manuals 1985 Standards New chapters on specific testing situations – Clinical – Educational – Counseling – Employment – Licensure & Certification – Program Evaluation – Linguistic Minorities – “People who have handicapping conditions” 1985 Standards New chapters on – Administration, scoring, reporting – Protecting the rights of test takers – General principles of test use Listed standards as – primary, – secondary, or – conditional. 1999 Standards New “Fairness in Testing” section No more “primary,” “secondary,” “conditional.” 3-part organizational structure 1. Test construction, evaluation, & documentation 2. Fairness in testing 3. Testing applications 1999 Standards (2) Incorporated the “argument-based approach to validity” Five “Sources of Validity Evidence” 1. Test content 2. Response processes 3. Internal structure 4. Relations to other variables 5. Testing consequences We’ll return to these sources later. Comparing the Standards: Packing & Unpacking Validity Evidence Edition Validity 1954 Construct, concurrent, predictive, content Criterion-related, construct, content Criterion-related, construct, content Unitary (but, content-related evidence, etc.) Unitary: 5 sources of evidence 1966 1974 1985 1999 What are the current and influential definitions of validity? Cronbach: Influential, but not current (1971…) Messick (1989…) Shepard (1993) Standards (1999) Kane (1992, 2006) Messick (1989): 1st sentence “Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.” (p. 13) This “integrated” judgment led Messick, and others, to conclude All validity is construct validity. It outside my purpose today to debate the unitary conceptualization of validity, but like all theories, it has strengths and limitations. But two quick points… Unitary conceptualization of validity Focuses on inferences derived from test scores – Assumes measurement of a construct motivates test development and purpose The focus on analysis of scores may undermine attention to content validity Removal of term “content validity” may have had negative effect on validation practices. Consider Ebel (1956) “The degree of construct validity of a test is the extent to which a system of hypothetical relationships can be verified on the basis of measures of the construct…but this system of relationships always involves measures of observed behaviors which must be defended on the basis of their content validity” (p. 274). Consider Ebel (1956) “Statistical validation is not ann alternative to subjective evaluation, but an extension of it. All statistical procedures for validating tests are based ultimately upon common sense agreement concerning what is being measured by a particular measurement process” (p. 274). The 1999 Standards accepted the unitary conceptualization, but also took a practical stance. The practical stance stems from the use of an argument-based approach to validity. – Cronbach (1971, 1988) – Kane (1992, 2006) The Standards (1999) succinctly defined validity “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.” (p. 9) Why do I say the Standards incorporated the argument-based approach to validation? “Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.” (AERA et al., 1999, p. 9) Kane (1992) “it is not possible to verify the interpretive argument in any absolute sense. The best that can be done is to show that the interpretive argument is highly plausible, given all available evidence” (p. 527). Kane: Argument-based approach a) Decide on the statements and decisions to be based on the test scores. b) Specify inferences/assumptions leading from test scores to statements and decisions. c) Identify competing interpretations. d) Seek evidence supporting inferences and assumptions and refuting counterarguments. Philosophy of Validity Messick (1989) “if construct validity is considered to be dependent on a singular philosophical base such as logical positivism and that basis is seen to be deficient or faulty, then construct validity might be dismissed out of hand as being fundamentally flawed” (p. 22). Messick (1989) “nomological networks are viewed as an illuminating way of speaking systematically about the role of constructs in psychological theory and measurement, but not as the only way” (p. 23). 3 perspectives on rel. b/w test and other indicators of construct 1. Test and nontest consistencies are manifestations of real traits. 2. Test and nontest consistencies are defined by rel. among constructs in a theoretical framework. 3. Test and nontest consistencies are attributable to real entities but are understood in terms of constructs. See Messick (1989) Figures 2.1-2.3 Messick on test validation “test validation is a process of inquiry” (p. 31) 5 systems of inquiry – Leibnizian – Lockean – Kantian – Hegelian – Singerian Systems of inquiry Main points – Validation can seek to confirm – Validation can seek consensus (Leibniz, Lock) – Validation can seek alternative hypotheses – Validation can seek to disconfirm (Kant, Hegel, Singer) Two other important points by Messick (1989) “the major limitation is shortsightedness with respect to other possibilities” (p. 33) “The very variety of methodological approaches in the validational armamentarium, in the absence of specific criteria for choosing among them, makes it possible to select evidence opportunistically and to ignore negative findings” (p. 33) If you look at the seminal papers and textbooks, and the various editions of the Standards, there are several fundamental and consensus tenets about validity theory and test validation. Fundamental Validity Tenets Validity is NOT a property of a test. A test cannot be valid or invalid. What we seek to validate are (inferences) uses of test scores. Validity is not all or none. Test validity must be evaluated with respect to a specific testing purpose. Thus, a test may be appropriate for one purpose, but not for another. Fundamental Validity Tenets (cont.) Evaluating the validity of inferences derived from test scores requires multiple lines of evidence (i.e., different types of evidence for validity). Test validation never ends—it is an ongoing process. I believe these tenets can be considered “consensus” due to their incorporation in the standards and predominance in the literature. But of course, not everyone need agree with consensus, and we will here important points from detractors over the next two days. Criticisms of this perspective Tests are never truly validated (we are never done). No prescription or guidance regarding specific types of evidence to gather and how to gather it. Ideal goals with no guidance leads to inaction. The argument-based approach is a compromise between sophisticated validity theory and the reality that at some point, we must make a judgment about the defensibility and suitability of use of a test for a particular purpose. What guidance does the Standards give us? Five “sources of evidence that might be used in evaluating a proposed interpretation of test scores for particular purposes” (Messick, 1989, p. 13). “Validation is a matter of making the most reasonable case to guide both current use of the test and current research to advance understanding of what the test scores mean… To validate an interpretive inference is to ascertain the degree to which multiple lines of evidence are consonant with the inference, while establishing that alternative inferences are less well supported.” The current Standards Provide a useful framework for evaluating the use of a test for a particular purpose. – And for documenting validity evidence Allow us to use multiple lines of evidence to support use of a test for a particular purpose But, are not prescriptive and do not provide examples or references to “adequate” validity arguments. Standards’ Validation Framework Validity evidence based on 1. Test content 2. Response processes 3. Internal structure 4. Relations to other variables 5. Testing consequences What is helpful in the Standards framework? It provides a system for categorizing validity evidence so that a coherent set of evidence can be put forward. It provides a way of standardizing the reporting of validity evidence. It focuses on both test construction and test score validation activities. Emphasizes the importance of evaluating consequences What are the limitations in the Standards framework? Not all types of evidence of validity fit into the 5 sources categories. No examples of good validation studies or of when sufficient evidence is put forth No statistical guidance No references Vagueness in some areas Suggestions for revising the Standards (1) Need to refine sources of validity evidence to accommodate – Analysis of group differences – Alignment research – Differential item functioning – Statistical analysis of test bias Need more clarity on validity evidence for accountability testing (groups, rather than individuals) Suggestions for revising the Standards (2) Need to define “score comparability” – Across subgroups of examinees taking a single assessment – Across accommodations to standardized assessments – Across different language versions of an assessment – Across different tests in CAT/MST – Across different modes of assessment Suggestions for revising the Standards (3) Include specific examples of laudable test validation analyses and references to studies that exemplify sound validity arguments. Closing remarks There are different perspectives on validity theory. Whether a test is valid for a particular purpose will always be a question of judgment. A sound validity argument makes the judgment an easy one to make. Closing remarks (2) For educational tests, validity evidence based on test content, is fundamental. Without confirming the content tested is consistent with curricular goals, the test adequately represents the intended domain, and the test is free of constructirrelevant material, the utility of the test for making educational decisions will be undermined. Why are there different perspectives on validity? It’s philosophy. It’s okay to disagree, but we need consensus with respect to nomenclature, and that is where differences can hurt us as a profession. For over 50 years, the Standards have provided consensus definitions. Remember, thhreats to validity boil down to Construct underrepresentation Construct-irrelevant variance “Tests are imperfect measures of constructs because they either leave out something that should be included…or else include something that should be left out, or both” (Messick, 1989, p. 34) Adhering to and Improving the Standards I don’t agree with everything in the Standards. But I find it easier to work within the framework, than against it. Advice for the remainder of the conference, and for your future validity endeavors If you criticize, have specific improvements to contribute. – (e.g., evidence-centered design) Consider different perspectives on validity when evaluating use of a test for a particular purpose – If one statistical analysis is offered as “validation,” be suspicious – Look for evidence in test construction Thank you for your attention And thanks to Bob Lissitz and UMD for the invitation and for holding this conference. I look forward to continuing the conversation. There is certainly a lot more to hear, and to say. Sireci@acad.umass.edu