Uses and Misuses of Subscale Scores Garron Gianopulos Test Development Section ~ Division of Accountability Services ~ NCDPI Collaborative Conference for Student Success Pebble Beach Tuesday, April 19th 8am to 9:30am Purpose • This presentation will describe when it is appropriate to use subscale scores to identify learning needs in students. • Case studies will be presented that contrast proper and improper uses of subscale scores. • Guidelines for interpreting scores will be provided. 2 Overview • A primer on reliability • Interpreting Scores – Standard errors of measurement – Confidence intervals – Subscale score profiles • Recommendations 3 Reliability and Decision Consistency • Reliability is a measure of consistency; • High quality tests produce similar scores upon re-administration. • Correlation coefficients are used to describe reliability. • Observed score = true score + error – Error: distraction, lack of sleep, missed breakfast, etc. • Reliability is a prerequisite to validity. 4 Reliability as Correlated Scores • Test re-test • Split-halves • Coefficient Alpha 5 Different Definitions of Reliability (Brennan 2005) • Correlations between parallel tests • Squared correlation between observed score with true score • The ratio of true score variance to observed score variance 6 True score r=1 Perfect measure 7 Tset, time 1 r=.87 Time 2 8 Tset, time 1 r= .82 Time 2 9 Tset, time 1 r=.77 Time 2 10 Tset, time 1 r=.61 Time 2 11 Tset, time 1 r=.21 Time 2 12 Tset, time 1 r=.05 Time 2 13 Reliability • Reliability is a property of a score, not a test. • Reliability arises from the interaction of a population of students and a set of items. – If the population changes, reliability changes. – If the set of items changes, reliability changes. • Reliability drops quickly as the discrimination of items diminish. • Reliability drops quickly as the number of items diminish. 14 Test Length and Reliability 15 Reliability and Classification Accuracy 16 Classification Accuracy • Consistency- percent of students classified in the same manner across test administrations • Accuracy-percent of students correctly classified 17 r= 1, Percent correct: 100 True Score True Positive True Negative Perfect Measure 18 Classification Accuracy Exam ple Classificati on Above cut? Above cut according to subscore? Intervention Result ? 1 True negative no no yes Learning need addressed 2 True positive yes yes no Good decision 19 r=.87 Classification: 88% correct and 12% incorrect True Positive True Negative False Positive True Score False Negative Imperfect Measure Note: Cut score was set at the mean. Classification accuracy changes as the cut score changes. 20 Classification Accuracy Exam ple Classificati on Above cut? Above cut according to subscore? Intervention Result ? 1 True negative no no yes Learning need addressed 2 True positive yes yes no Good decision 3 False positive no yes no Missed training need 4 False negative yes no yes Chasing error; wasted resources 21 r=.79 True Score Classification: 87% correct and 13% incorrect False Negative True Positive True Negative False Positive Imperfect Measure Note: Cut score was set at the mean. Classification accuracy changes as the cut score changes. 22 r=.77 Classification: 86% correct and 14% incorrect True Positive True Negative False Positive True Score False Negative Imperfect Measure Note: Cut score was set at the mean. Classification accuracy changes as the cut score changes. 23 r=.61 Classification: 80% correct and 20% incorrect True Positive True Negative False Positive True Score False Negative Imperfect Measure Note: Cut score was set at the mean. Classification accuracy changes as the cut score changes. 24 r=.27 Classification: 70% correct and 30% incorrect True Positive True Negative False Positive True Score False Negative Imperfect Measure Note: Cut score was set at the mean. Classification accuracy changes as the cut score changes. 25 r=.03 Classification: 50% correct and 50% incorrect True Positive True Negative False Positive True Score False Negative Imperfect Measure 26 Classification Accuracy and Consistency • . . .is influence by the location of the cut score and the shape of the score distribution. • . . . is positively associated with reliability. • . . . is a property of a score, not a test. • . . . arises from the interaction of a population of students and a set of items. – If the population changes, accuracy changes. – If the set of items changes, accuracy changes. • . . . drops quickly as the discrimination of items diminish. • . . . drops quickly as the number of items diminish. 27 Test Length, Reliability, and Classification Accuracy Note: This line graph is a function of the location of the cut score, the shape of the score distribution, and the discrimination of the test items of the test. If any of these factors change, this line chart will change. 28 Overview • A primer on reliability • Interpreting Scores – Standard errors of measurement – Confidence intervals – Subscale score profiles • Recommendations 29 Standard Error of the Measure (SEM) Standard deviation 30 Percent 25 20 15 10 5 0 0 5 10 15 20 Scale score • SEM = standard deviation of scores across repeated test administrations 30 Standard Error of the Measure (SEM) 68% confidence interval • 68% confidence interval =score +/- 1*standard error • 95% confidence interval =score +/- 2*standard error 31 Standard Error of the Measure (SEM) 68% confidence interval • 68% confidence interval =score +/- 1*standard error • 95% confidence interval =score +/- 2*standard error 32 Standard Error of the Measure (SEM) True Score Score x • 68% of confidence intervals will contain the true score 33 Comparing Two Scores True Score Score x • Non-overlapping confidence intervals are taken as evidence that the true scores differ. 34 Comparing Two Scores True Score Score x • Overlapping confidence intervals are taken as evidence that the true scores do not differ. 35 Simulated Score Distribution across 100 replications • 5 items, r = .40 • 10 items, r = .55 • 20 items, r =.75 • 40 items, r = .85 36 Overview • A primer on reliability • Interpreting Scores – Standard errors of measurement – Confidence intervals – Subscale score profiles • Recommendations 37 Subscale Score Profiles • What are important features of score profiles? – Case studies from the ISR. • What are common problems with interpreting subscale score profiles on the ISR? • What are best-practice guidelines to follow when interpreting subscale scores? The Individual Student Report 39 How ISR Subscales are Created • Item Response Theory • Scale: – Min = 0 – Max =20 – Mean=10 – STD = 3 Features of Subscore Profiles • Scatter of the profile: distance between each score • Elevation of the profile: distance between each score and the maximum score 41 Scatter 42 Elevation 43 Math Grade 4: Example of High Scatter (Rare) 44 Math Grade 4: Example of High Scatter (Rare) 45 Math Grade 4: Example of Low Scatter, Low Elevation 46 Math Grade 4: Example of Low Scatter, Low Elevation 47 Math Grade 4: Example of Low Scatter, High Elevation 48 What proportion* of profiles are scattered? • 70% to 75% of profiles contain confidence intervals that completely overlap. • 15% to 20% of profiles contain two (out of five) confidence intervals that do not overlap. • 5% to 10% of profiles contain more than two (out of five) confidence intervals that do not overlap *Note: these percentages were based on a sample of three EOG assessments and may not apply to all EOG assessments in all subject areas in all grades. Subscale Score Profiles • What are important features of score profiles? • What are common problems with interpreting subscale score profiles on the ISR? • What are best-practice guidelines to follow when interpreting subscale scores? Common Problems with Reporting, Interpreting, and Using Subscale Scores • Objective level: – Reliability for nearly all scores is low – Misclassification rates are very high • Goal level: – Reliability is low to moderate – Misclassifications rates are high – Feedback is not detailed enough to be useful • Higher level: – Reliability and classification error is acceptable – Feedback is not detailed enough to be useful 51 Test Length, Reliability, and Classification Accuracy Calc. active/ inactive and read Literary/ informational reading Scores at the goal level Scores at the objective level 52 Purpose of EOG/EOC Tests • To accurately classify students into achievement levels. • Test is not diagnostic 53 Purpose of NC EOC/EOG Tests The North Carolina End-of-Course Tests are required by General Statute 115C- 174.10 as a component of the North Carolina Annual Testing Program. The purposes of North Carolina state-mandated tests are: (i) to assure that all high school graduates possess those minimum skills and that knowledge thought necessary to function as a member of society, (ii) to provide a means of identifying strengths and weaknesses in the education process* in order to improve instructional delivery, and (iii)to establish additional means for making the education system at the State, local, and school levels accountable to the public for results.” *Not strengths and weaknesses within a student per se. 54 Misuses of Subscale Scores • Judging teacher effectiveness based on unreliable subscale scores. • Deciding a student has a weakness on the basis of an unreliable subscale and nothing else. • Deciding a student has a strength on the basis of an unreliable subscale and nothing else. • Inferring growth through unreliable subscale scores. 55 Common Misuses of Subscale Score Profiles • Chasing error – False negatives are given additional unnecessary intervention or a training. – Can happen if particular sample of items in test do not adequately cover the subdomain (i.e. sample size) – Perceived strengths/weaknesses can simply be noise • Missed training need- False positives are believed to be at a high level of proficiency and are not given remediation. – Can happen on short tests containing items that are susceptible to guessing 56 Example of chasing error 57 68% Confidence intervals for 20 item subtest (flat profile near mean) Statistic Reliability Reliability .70 Type False Positives False weaknesses 5% False strengths 1% Did confidence interval capture the true score? Subtest Yes No 1 68 32 2 64 36 3 69 31 4 73 27 5 60 40 66.8 33.2 Mean 100 replications across 100 different but similar forms 58 59 60 61 62 63 64 65 66 67 68 Probabilities of capturing the true score across X profile scores Cumulative probability Cumulative probability Cumulative probability • Using 95% confidence intervals increases the probability that the true score will be captured. • 68% vs. 95% • If reliability is low, 95% confidence intervals span such a large 69 portion of the scale, they lose the ability to show differences. Math Grade 4: Example of High Scatter (Rare) = 68% Confidence Interval = 95% Confidence Interval 70 Subscale Score Profiles • What are important features of score profiles? • What are common problems with interpreting subscale score profiles on the ISR? • What are best-practice guidelines to follow when interpreting subscale scores? Recommendations Recommendation 1: Do not depend solely on subscale score profiles that contain small numbers of items to make important instructional or intervention decisions. 72 Recommendations • Classification error will be high • Seek additional evidence and collateral information on a student’s ability in a given subscale. • Be especially aware of subscale scores that have very few items when interpreting an ISR. (Use Test Information Sheets to identify short subtests http://www.ncpublicschools.org/accountability/testing/eog/archives /tisarchives) 73 Recommendation 2: If possible, augment the results of the ISR score profiles with reliable, diagnostic assessments that are well aligned to the curriculum to assess student strengths and weaknesses. 74 What kind of test will produce good diagnostic information? Subtests correlate Subtests do NOT highly with total correlate highly score with total score Short subtests Unreliable subscores Low reliability limits and little or no value utility of subtests is added to total score Long linear subtests or short adaptive subtests Reliable but no value is added to the total score Ideal diagnostic assessment •EOG and EOC subtests administered operationally produce subscale scores that correlate highly with total score (r = .70 to .95) 75 Guidelines • What is the purpose of the test? • How were the scores intended to be used by the test developers? • How reliable is the test score being interpreted? • Is the reliability of the test adequate for the interpretation you are making? What are the consequences of FP and FNs? 76 Recommended Reliabilities (Alpha) • • • • .85 and above for high-stakes decisions .75 to .85 for moderate stakes .65 to .75 for low stakes <.65 disregard 77 Other approaches to diagnostic feedback • Subscales in conjunction with Computer Adaptive Assessments • Item Maps in conjunction with nonadaptive or adaptive assessments • Cognitive diagnostic modeling • Formative Assessment 78 Guidelines for Identifying Learning Needs • Use diagnostic tests • Use aligned tests • Consider using released EOG/EOC assessments – Consider augmenting with additional items to increase the number of items in each goal 79 Recommendation 3: In the absence of reliable subscale scores, focus training on all subscale scores with proportionately more time allocated to longer subtests. 80 Testing Materials Test Information Sheet 1998 SCS • http://www.ncpublicschools.org/accountability/testing/ eog/ • http://www.ncpublicschools.org/accountability/testing/ eoc/ Interpretive Guide for Winscan32 • http://www.ncpublicschools.org/accountability/te sting/shared/abriefs/eoc • http://www.ncpublicschools.org/accountability/te sting/shared/abriefs/eog (soon to appear online) 81 Contact Information • Garron Gianopulos • ggianopulos@dpi.state.nc.us Questions? References • Brennan, R. (2005). Some Test Theory for the Reliability of Individual Profiles. CASMA Research Report, No. 12. • Dorans, N. Walker, M. (2007). Sizing Up Linkages in Linking and Aligning Scores and Scales. Springer Science+Business Media,LLC. New York. 84 • Appendix Standard 11.1 • Prior to the use of a published test, the test user should study and evaluate the materials provided by the test developer. Of particular importance are the those that summarize the test’s purposes, specify the procedures for test administration, define the intended population of test takers, and discuss the score interpretations for which validity and reliability data are available. 86 Standard 11.20 • Test taker’s scores should not be interpreted in isolation; collateral information that may lead to alternative explanation for performance should be considered. 87