TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT The Many Threats to Test Validity David Mott, Tests for Higher Standards and Reports Online Systems Presentation at the Virginia Association of Test Directors (VATD) Conference, Richmond, VA, October 28, 2009 Use View>Notes to see the speakers notes and comments. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT The Many Threats to Test Validity In order for a test or assessment to have any value whatsoever, it must be possible to make reasonable inferences from the score. This is much harder than it seems. The test instruments, the testing conditions, the students, and the score interpreters, and perhaps Fate, ALL need to be working together to produce data worth using. Many specific threats will be delineated; a number of solutions suggested; and audience participation is strongly encouraged. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity and Value come from the same Latin root. The word has to do with being strong, well, good. Validity = Value TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Initial Attitude Adjustment Amassing Statistics The government are very keen on amassing statistics — they collect them, raise them to the n th power, take the cube root and prepare wonderful diagrams. But what you must never forget is that every one of those figures comes in the first instance from the village watchman, who just puts down what he damn well pleases. (J. C. Stamp (1929). Some Economic Factors in Modern Life. London: P. S. King and Son) Distance from Data I have noticed that the farther one is from the source of data, the more likely one is to believe that the data could be a good basis for action. (D. E. W. Mott (2009). Quotations.) TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + The Examination The Examination F ACILITATE ACHIEVEMENT as shown shown by by the the Ghost Ghost of of Testing Testing Past Past as TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — Older Formulations 1950’s through 1980’s content validity concurrent validity predictive validity construct validity Lee J. Cronbach TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Content Validity — Refers to the extent to which a measure represents all facets of a given social construct. Social constructs such as: Reading Ability, Math. Computation Proficiency, Optimism, Driving Skill, etc. It is a more formal term than face validity. As face validity refers, not to what the test actually measures, but to what it appears to measure. Face validity is whether a test "looks valid" to the examinees who take it, the administrative personnel who decide on its use, and to others. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Concurrent Validity — Refers to a demonstration of how well a test correlates well with a measure that has previously been validated. The two measures may be for the same construct, or for different, but presumably related, constructs. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Predictive Validity — Refers to the extent to which a score on a scale or test predicts scores on some criterion measure. For example, how well do your final benchmarks predict scores on the state SOL Tests? TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Construct Validity — Refers to whether a scale measures or correlates with the theorized underlying psychological construct (e.g., "fluid intelligence") that it claims to measure. It is related to the theoretical ideas behind the trait under consideration, i.e. the concepts that organize how aspects of personality, intelligence, subject-matter knowledge, etc. are viewed. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation 1990’s through now Six aspects or views of Construct Validity content aspect substantive aspect structural aspect generalizability aspect external aspect consequential aspect Samuel Messick TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Validity — New Formulation Six aspects or views of Construct Validity Content aspect – evidence of content relevance, representativeness, and technical quality Substantive aspect – theoretical rationales for consistency in test responses, including process models, along with evidence that the processes are actually used in the assessment tasks Structural aspect – judges the fidelity of scoring to the actual structure of the construct domain Generalizability aspect – the extent to which score properties and interpretations generalize to related populations, settings, and tasks External aspect – includes converging and discriminating evidence from multitrait-multimethod comparisons as well as proof of relevance and utility. Consequential aspect – shows the values of score interpretation as a basis for action and the actual and potential consequences of test use, especially in regard to invalidity related to bias, fairness, and distributive justice TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Administration Validity Administration Validity is my own term. A test administration or a test session is valid if nothing happens that causes a test, an assessment, or a survey to fail to reflect the actual situation. Test-session validity is an alternate term. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Administration Validity Many things can come between the initial creation of an assessment from valid materials and the final uses of the scores that come from that assessment. Imagine a chain that is only as strong as its weakest link. If any link breaks, the value of the whole chain is lost. This session deals with some of those weak links. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Areas of Validity Failure TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Areas of Validity Failure We create a test out of some “valid” items — Discuss some of the realities most of us face: We either have some “previously validated” tests or we have a “validated” item bank we make tests from. Let’s assume that they really are valid, this is, the materials have good content matches with the Standards/ Curriculum Frameworks/Blueprints, and so on. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Areas of Validity Failure Some examples of things that can creep in within the supposedly “mechanical” aspects of creating a test from a bank. Here are two items from a Biology benchmark test we recently made for a client: TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Two Biology Items Bio.3b 5. Which organic compound is correctly matched with the subunit that composes it? A maltose – fatty acids B starch – glucose C protein – amino acids D lipid – sucrose Bio.3b 6. Which organic compounds are the building blocks of proteins? A sugars B nucleic acids C amino acids D polymers TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Two Biology Items Bio.3b 5. Which organic compound is correctly matched with the subunit that composes it? Standard BIO.3b The student will A maltose – fatty acids investigate and understand the chemical B starch – glucose and biochemical principles essential for C protein – amino acids life. Key concepts include b) the structure and function of macromolecules. D lipid – sucrose Bio.3b 6. Which organic compounds are the building blocks of proteins? A sugars B nucleic acids C amino acids D polymers TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Two Biology Items Bio.3b 5. Which organic compound is correctly matched with the subunit that composes it? A maltose – fatty acids B starch – glucose C protein – amino acids * D lipid – sucrose Bio.3b 6. Which organic compounds are the building blocks of proteins? A sugars B nucleic acids C amino acids D polymers TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Two Biology Items Bio.3b 5. Which organic compound is correctly matched with the subunit that composes it? A maltose – fatty acids B starch – glucose C protein – amino acids * D lipid – sucrose Bio.3b 6. Which organic compounds are the building blocks of proteins? A sugars B nucleic acids C amino acids * D polymers TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT A Life Science Item LS.6c 12. In this energy pyramid, which letter would represent producers? A B C D A B C D A B C D TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT The same Life Science Item “Randomized” LS.6c 12. In this energy pyramid, which letter would represent producers? A B C D C D A B A B C D TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Moving from test creation to test administration TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What Can Fail in the Test Administration Process TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What Can Fail in the Test Administration Process Students aren’t properly motivated Random responding Patterning responses Unnecessary guessing Cheating Let’s look at what some of these look like: TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT One Student's Item Analysis A B C D 1 1 2 0 3 0 4 0 5 0 6 1 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 1 16 0 17 0 18 1 19 0 20 1 21 0 22 0 23 0 24 0 25 1 What happened here? 6 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Another Student's Item Analysis A B C D 1 0 2 0 3 0 4 1 5 0 6 1 7 0 8 0 9 0 10 0 11 0 12 0 13 1 14 0 15 0 16 1 17 0 18 1 19 1 20 0 21 0 22 0 23 1 24 0 25 0 7 What happened here? TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Yet Another Student's Item Analysis A B C D 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 0 15 1 16 1 17 1 18 1 19 0 20 0 21 0 22 1 23 0 24 0 25 1 19 What happened here? TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What Can Fail in the Test Administration Process Students or teachers make mistakes. Stopping before the end of test Getting off position on answer sheets Giving a student the wrong answer sheet Scoring a test with the wrong key Let’s look at what some of these look like: TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Yet, Yet Another Student's Item Analysis A B C D 1 1 2 1 3 1 4 1 5 1 6 1 7 8 0 9 0 10 0 11 1 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 1 20 0 21 0 22 0 23 0 24 0 25 0 8 26 0 27 28 29 30 What happened here? TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Test results interpretation – Many dangers Diagnosing students’ needs TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What is the obvious conclusion about these test results? Standard Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Average Results for a Three-Standard Test 4.1 4.2 4.3 Subtotal Subtotal Subtotal 3 2 4 4 3 5 3 1 5 5 0 5 5 0 4 5 2 5 4 1 4 4 0 3 4 1 3 5 2 4 5 1 4 5 0 5 .87 .22 .85 Total 9 12 9 10 9 12 9 7 8 11 10 10 .64 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What do you think now? Standard Item Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 Student 11 Student 12 Average 4.1 1 1 1 1 1 1 1 0 1 1 1 1 1 .92 4.1 2 1 0 1 1 1 1 1 1 1 1 1 1 .92 4.1 3 0 1 1 1 1 1 1 1 1 1 1 1 .92 4.1 4 1 1 0 1 1 1 1 0 1 1 1 1 .83 4.1 5 0 1 0 1 1 1 1 1 0 1 1 1 .75 Results for a Three-Standard Test 4.3 4.3 4.3 4.3 4.3 6 7 8 9 10 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 .92 1.00 .83 .67 .83 4.2 11 1 1 1 0 0 1 1 0 1 1 1 0 .67 4.2 12 1 1 0 0 0 1 0 0 0 1 0 0 .33 4.2 13 0 1 0 0 0 0 0 0 0 0 0 0 .08 4.2 14 0 0 0 0 0 0 0 0 0 0 0 0 .00 4.2 15 0 0 0 0 0 0 0 0 0 0 0 0 .00 Total 9 12 9 10 9 12 9 7 8 11 10 10 .64 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT The chain has many links Nearly any of them can break Try to find the weakest links in your organization’s assessment efforts Fix them – one by one TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What are some of my solutions to all of this? To the problems of mistakes in test creation Use test blueprints Be very careful of automatic test construction Read the test carefully yourself and answer the questions Have someone else read the test carefully and answer the questions Use “Kid-Tested” items * * Future TfHS initiative TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT What are some of my solutions to all of this? Be careful when reading reports – look past the obvious For problems of careless, unmotivated test taking by students (even cheating) — Make the test less of a contest between the system/teacher and the student and more of a communication device between them Watch the students as they take the test and realize that proctoring rules necessary for high-stakes tests are possibly not best for formative or semi-formative assessments Look for/flag pattern marking and rapid responding * Watch the students as they take the test * Future TfHS/ROS initiative TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Here is a graph showing the timing of student responses to an item Number of Responses Over Time to a Rather Easy Item 16 Number of Responses 14 12 10 8 6 4 2 0 0 0.5 1 1.5 2 2.5 3 3.5 Time (in sec) 4 4.5 5 5.5 6 6.5 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT For online tests it is possible to screen for rapid responding * Number of Responses Over Time to a Rather Easy Item 16 Number of Responses 14 12 10 8 6 4 2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Time (in sec) * Future TfHS/ROS initiative TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT So what do we do about this? We can show a dialog box asking the student to really answer the item. Or We can not score those item and calculate the overall test score % correct on the number of items they actually took. Or Something else. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Let’s look at tests in an entirely different way! Instead of thinking Evaluation, think Communication. A test can be a way for students to communicate to teachers: What they know, What they think they know, and What know they know they don’t know. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT A major new way of communicating! Let the students tell you when they don’t know or understand something – eliminate guessing New mc scoring scheme: * 1 point for each correct answer 0 points for each wrong answer ⅓ point for each unanswered question Students mark where they run out of time * Future TfHS/ROS initiative TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Continued A major new way of communicating! Students have to be taught the new rules Students need one or two tries to get the hang of it Students need to know when the new scoring applies It is better for students to admit not knowing than to guess TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT One Student's Item Analysis A B C D 1 1 2 1 3 1 4 1 5 Answering under the 6 1 Test Length 25 7 1 Regular 1, 1/3, 0 scheme new scoring 8 1 11 13.00 9 1 44.00% 52.00% 10 1 11 Corrected for Test Length 12 1 Regular 1, 1/3, 0 13 11 13.00 14 0 55.00% 65.00% 15 16 17 1 18 19 20 0 21 0 0 22 23 24 25 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT One Student's Item Analysis A B C D 1 1 2 1 3 1 4 1 5 Answering under the 6 1 Test Length 25 7 1 Regular 1, 1/3, 0 scheme new scoring 8 1 11 13.00 9 1 44.00% 52.00% 10 1 11 Corrected for Test Length 12 1 Regular 1, 1/3, 0 13 11 13.00 14 0 55.00% 65.00% 15 16 17 1 18 19 20 0 21 0 0 22 23 24 25 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Scored under the new scoring scheme One Student's Item Analysis A B C D 1 1 2 1 3 1 4 1 5 6 1 Test Length 25 7 1 Regular 1, 1/3, 0 11 13.00 8 1 44.00% 52.00% 9 1 10 1 11 Corrected for Test Length 12 1 Regular 1, 1/3, 0 11 11.00 13 55.00% 84.62% 14 0 15 16 17 1 18 19 20 0 21 0 0 22 23 24 25 TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT Humor Time flies like an arrow; fruit flies like a banana. We sometimes need to take a 90° turn in our thinking. TE S T S F O R HI G H E R STA N D A R D S P ROVIDE F OCUS + F ACILITATE ACHIEVEMENT My contact information David Mott – TfHS website – ROS website – dem@rosworks.com 866.724.7997 804.282.3111 www.tfhs.net rosworks.com