Validity of Assessments Exercise: An Officer’s Question You say that your MCCSSS course teaches students to (reason well) in (a specific domain). What evidence can you provide that it actually does so? Three Key Concepts in Judging the Quality of an Assessment Validity Reliability Usability Why should you be bothered with these concepts, anyway? Appreciate why all assessments contain error Know the various sources of error Understand that different kinds of assessments are prone to different kinds of error Build assessments with less error Know how to measure error, if need be Know what is safe—and not safe—to conclude from assessment results Decide when certain assessments should not be used Validity Definition: Appropriateness of how scores are interpreted [and used]* That is, to what extent does your assessment measure what you say it does [and is as useful as you claim]? Stated another way: To what extent are the interpretations and uses of a test justified by evidence about its meaning and consequences. *Appropriate "use" of tests is a controversial recent addition to the definition of "validity." That is probably why your textbook is inconsistent in how it defines it. Validity Very important points. Validity interpretations are: 1. 2. 3. 4. a matter of degree ("how valid") always specific to a particular purpose ("validity for…") a unitary concept (four kinds of evidence to make one judgment—"how valid?") must be inferred from evidence; cannot be directly measured Validity of Assessments 1 Validity Evidence Four interrelated kinds of evidence: 1. 2. 3. 4. content construct criterion consequences Questions Guiding Validation 1. What are my learning objectives? o Did my test really address those particular objectives? 2. Do the students' test scores really mean what I intended? o What may have influenced their scores? growth instruction intelligence cheating etc. 3. Did testing have the intended effects? o What were the consequences of the testing process and scores obtained? What is an achievement domain? A carefully specified set or range of learning outcomes; in short, your set of instructional objectives. Content-Related Evidence Definition: The extent to which an assessment’s tasks provide a relevant and representative sample of the domain of outcomes you are intending to measure. The evidence: 1. 2. 3. most useful type of validity evidence for classroom tests domain is defined by learning objectives items chosen with table of specifications Content-Related Evidence Important points: is an attempt to build validity into the test rather than assess it after the fact sample can be faulty in many ways a. inappropriate vocabulary Validity of Assessments 2 b. unclear directions c. omits higher order skills d. fails to reflect content or weight of what actually taught "face validity" (superficial appearance) or label does not provide evidence of validity assumes that test administration and scoring were proper What is a construct? A hypothetical quality or construct (e.g., extraversion, intelligence, mathematical reasoning ability) that we use to explain some pattern of behavior (e.g., good at making new friends, learns quickly, good in all math courses). Construct-Related Evidence Definition: The extent to which an assessment measures the construct (e.g., reading ability, intelligence, anxiety) the test purports to measure Construct-Related Evidence Some kinds of evidence: see if items behave the same (if test meant to measure a single construct) analyze mental processes required compare scores of known groups compare scores before and after treatment (do they change in the way your theory says they will and will not?) correlate scores with other constructs (do they correlate well—and poorly—in the pattern expected?) Construct-Related Evidence Important points: usually assessed after the fact usually requires test scores is a complex, extended logical process; cannot be quantified What is a criterion? A valued performance or outcome (e.g., scores high on a standardized achievement test in math, later does well in an algebra class) that we believe might—or should— be related to what we are measuring (e.g., knowledge of basic mathematical concepts). Validity of Assessments 3 Criterion-Related Evidence Definition: The extent to which a test’s scores correlate with some valued performance outside the test (the criterion) The evidence: concurrent correlations (relate to a different current performance) predictive correlations (predict a future performance) Clarification: The word "criterion" is used in a second sense in testing, so don't get them confused. In this context it means some outcome that we want to predict. In the other sense, it is a performance standard against which we are comparing students' scores. In the latter sense, it is used to distinguish "criterion-referenced" interpretations of test scores from "norm-referenced" test scores. Susan reads at the "proficient" level would be a criterion-referenced interpretation. (She reads better than 65% of other students would be a norm-referenced interpretation.) What is a correlation? A statistic that indicates the degree of relationship between any two sets of scores obtained from the same group of individuals (e.g., correlation between height and weight). Called: validity coefficient when used in calculating criterion-related evidence of validity reliability coefficient when used in calculating reliability of test scores Criterion-Related Evidence Important points: always requires test scores is quantified (i.e., a number) must be interpreted cautiously because irrelevant factors can raise or lower validity coefficients (unreliability, spread of scores, etc.) often hard to find a good "criterion" can be used to create "expectancy tables" What is a consequence? Any effect that your assessment has—or fails to have—that is important to you or the other people involved. Validity of Assessments 4 Consequences-Related Evidence Definition: The extent to which the assessment serves its intended purpose (e.g., improves performance) and avoids negative side-effects (e.g., distorts the curriculum) Possible types of evidence: did it improve performance? motivation? independent learning? did it distort the focus of instruction? did it encourage or discourage creativity? exploration? higher level thinking? etc. Consequences-Related Evidence Important points: usually gathered after assessment is given scores may be interpreted correctly but the test still have negative side-effects have to weigh the consequences of not using the assessment (even if it has negative side-effects). Is the alternative any better—or maybe worse? judging consequences is a matter of values, not psychometrics Sources of Threats to Validity: Can you give examples of each? 1. 2. 3. 4. 5. tests themselves teaching administration and scoring students nature of group or criterion Validity of Assessments 5