Uses and Misuses of Subscale Scores Garron Gianopulos

advertisement
Uses and Misuses of Subscale
Scores
Garron Gianopulos
Test Development Section ~ Division of Accountability Services ~ NCDPI
Collaborative Conference for Student Success
Pebble Beach
Tuesday, April 19th
8am to 9:30am
Purpose
• This presentation will describe when it is
appropriate to use subscale scores to
identify learning needs in students.
• Case studies will be presented that contrast
proper and improper uses of subscale
scores.
• Guidelines for interpreting scores will be
provided.
2
Overview
• A primer on reliability
• Interpreting Scores
– Standard errors of measurement
– Confidence intervals
– Subscale score profiles
• Recommendations
3
Reliability and Decision Consistency
• Reliability is a measure of consistency;
• High quality tests produce similar scores
upon re-administration.
• Correlation coefficients are used to describe
reliability.
• Observed score = true score + error
– Error: distraction, lack of sleep, missed
breakfast, etc.
• Reliability is a prerequisite to validity.
4
Reliability as Correlated Scores
• Test re-test
• Split-halves
• Coefficient Alpha
5
Different Definitions of Reliability
(Brennan 2005)
• Correlations between parallel tests
• Squared correlation between observed
score with true score
• The ratio of true score variance to
observed score variance
6
True score
r=1
Perfect measure
7
Tset, time 1
r=.87
Time 2
8
Tset, time 1
r= .82
Time 2
9
Tset, time 1
r=.77
Time 2
10
Tset, time 1
r=.61
Time 2
11
Tset, time 1
r=.21
Time 2
12
Tset, time 1
r=.05
Time 2
13
Reliability
• Reliability is a property of a score, not a test.
• Reliability arises from the interaction of a
population of students and a set of items.
– If the population changes, reliability changes.
– If the set of items changes, reliability changes.
• Reliability drops quickly as the discrimination of
items diminish.
• Reliability drops quickly as the number of items
diminish.
14
Test Length and Reliability
15
Reliability and Classification
Accuracy
16
Classification Accuracy
• Consistency- percent of students classified
in the same manner across test
administrations
• Accuracy-percent of students correctly
classified
17
r= 1, Percent correct: 100
True Score
True Positive
True Negative
Perfect Measure
18
Classification Accuracy
Exam
ple
Classificati
on
Above cut?
Above cut
according to
subscore?
Intervention Result
?
1
True
negative
no
no
yes
Learning
need
addressed
2
True positive yes
yes
no
Good
decision
19
r=.87
Classification: 88% correct and 12% incorrect
True Positive
True Negative
False Positive
True Score
False Negative
Imperfect Measure
Note: Cut score was set at the mean. Classification
accuracy changes as the cut score changes.
20
Classification Accuracy
Exam
ple
Classificati
on
Above cut?
Above cut
according to
subscore?
Intervention Result
?
1
True
negative
no
no
yes
Learning
need
addressed
2
True positive yes
yes
no
Good
decision
3
False
positive
no
yes
no
Missed
training need
4
False
negative
yes
no
yes
Chasing
error; wasted
resources
21
r=.79
True Score
Classification: 87% correct and 13% incorrect
False Negative
True Positive
True Negative
False Positive
Imperfect Measure
Note: Cut score was set at the mean. Classification
accuracy changes as the cut score changes.
22
r=.77
Classification: 86% correct and 14% incorrect
True Positive
True Negative
False Positive
True Score
False Negative
Imperfect Measure
Note: Cut score was set at the mean. Classification
accuracy changes as the cut score changes.
23
r=.61
Classification: 80% correct and 20% incorrect
True Positive
True Negative
False Positive
True Score
False Negative
Imperfect Measure
Note: Cut score was set at the mean. Classification
accuracy changes as the cut score changes.
24
r=.27
Classification: 70% correct and 30% incorrect
True Positive
True Negative
False Positive
True Score
False Negative
Imperfect Measure
Note: Cut score was set at the mean. Classification
accuracy changes as the cut score changes.
25
r=.03
Classification: 50% correct and 50% incorrect
True Positive
True Negative
False Positive
True Score
False Negative
Imperfect Measure
26
Classification Accuracy and
Consistency
• . . .is influence by the location of the cut score and the
shape of the score distribution.
• . . . is positively associated with reliability.
• . . . is a property of a score, not a test.
• . . . arises from the interaction of a population of
students and a set of items.
– If the population changes, accuracy changes.
– If the set of items changes, accuracy changes.
• . . . drops quickly as the discrimination of items diminish.
• . . . drops quickly as the number of items diminish.
27
Test Length, Reliability, and
Classification Accuracy
Note: This line graph is a function of the location of the cut score, the
shape of the score distribution, and the discrimination of the test items of
the test. If any of these factors change, this line chart will change.
28
Overview
• A primer on reliability
• Interpreting Scores
– Standard errors of measurement
– Confidence intervals
– Subscale score profiles
• Recommendations
29
Standard Error of the Measure (SEM)
Standard deviation
30
Percent
25
20
15
10
5
0
0
5
10
15
20
Scale score
• SEM = standard deviation of scores across
repeated test administrations
30
Standard Error of the Measure (SEM)
68% confidence interval
• 68% confidence interval =score +/- 1*standard error
• 95% confidence interval =score +/- 2*standard error
31
Standard Error of the Measure (SEM)
68% confidence interval
• 68% confidence interval =score +/- 1*standard error
• 95% confidence interval =score +/- 2*standard error
32
Standard Error of the Measure (SEM)
True Score
Score x
• 68% of confidence intervals will contain the true
score
33
Comparing Two Scores
True Score
Score x
• Non-overlapping confidence intervals are taken as
evidence that the true scores differ.
34
Comparing Two Scores
True Score
Score x
• Overlapping confidence intervals are taken as
evidence that the true scores do not differ.
35
Simulated Score Distribution
across 100 replications
• 5 items, r = .40
• 10 items, r = .55
• 20 items, r =.75
• 40 items, r = .85
36
Overview
• A primer on reliability
• Interpreting Scores
– Standard errors of measurement
– Confidence intervals
– Subscale score profiles
• Recommendations
37
Subscale Score Profiles
• What are important features of score profiles?
– Case studies from the ISR.
• What are common problems with interpreting
subscale score profiles on the ISR?
• What are best-practice guidelines to follow when
interpreting subscale scores?
The Individual Student Report
39
How ISR Subscales are Created
• Item Response Theory
• Scale:
– Min = 0
– Max =20
– Mean=10
– STD = 3
Features of Subscore Profiles
• Scatter of the profile: distance between
each score
• Elevation of the profile: distance between
each score and the maximum score
41
Scatter
42
Elevation
43
Math Grade 4: Example of High
Scatter (Rare)
44
Math Grade 4: Example of High
Scatter (Rare)
45
Math Grade 4: Example of Low
Scatter, Low Elevation
46
Math Grade 4: Example of Low
Scatter, Low Elevation
47
Math Grade 4: Example of Low
Scatter, High Elevation
48
What proportion* of profiles are
scattered?
• 70% to 75% of profiles contain confidence
intervals that completely overlap.
• 15% to 20% of profiles contain two (out of five)
confidence intervals that do not overlap.
• 5% to 10% of profiles contain more than two
(out of five) confidence intervals that do not
overlap
*Note: these percentages were based on a sample of three EOG assessments and may not
apply to all EOG assessments in all subject areas in all grades.
Subscale Score Profiles
• What are important features of score profiles?
• What are common problems with interpreting
subscale score profiles on the ISR?
• What are best-practice guidelines to follow when
interpreting subscale scores?
Common Problems with Reporting,
Interpreting, and Using Subscale Scores
• Objective level:
– Reliability for nearly all scores is low
– Misclassification rates are very high
• Goal level:
– Reliability is low to moderate
– Misclassifications rates are high
– Feedback is not detailed enough to be useful
• Higher level:
– Reliability and classification error is acceptable
– Feedback is not detailed enough to be useful
51
Test Length, Reliability, and
Classification Accuracy
Calc. active/
inactive and
read Literary/
informational
reading
Scores at the
goal level
Scores at the
objective level
52
Purpose of EOG/EOC Tests
• To accurately classify students into
achievement levels.
• Test is not diagnostic
53
Purpose of NC EOC/EOG Tests
The North Carolina End-of-Course Tests are required by General Statute
115C- 174.10 as a component of the North Carolina Annual Testing
Program. The purposes of North Carolina state-mandated tests are:
(i) to assure that all high school graduates possess those minimum skills
and that knowledge thought necessary to function as a member of
society,
(ii) to provide a means of identifying strengths and weaknesses in the
education process* in order to improve instructional delivery, and
(iii)to establish additional means for making the education system at the
State, local, and school levels accountable to the public for results.”
*Not strengths and weaknesses within a student per se.
54
Misuses of Subscale Scores
• Judging teacher effectiveness based on
unreliable subscale scores.
• Deciding a student has a weakness on the
basis of an unreliable subscale and nothing
else.
• Deciding a student has a strength on the basis
of an unreliable subscale and nothing else.
• Inferring growth through unreliable subscale
scores.
55
Common Misuses of Subscale
Score Profiles
• Chasing error – False negatives are given additional
unnecessary intervention or a training.
– Can happen if particular sample of items in test do
not adequately cover the subdomain (i.e. sample
size)
– Perceived strengths/weaknesses can simply be
noise
• Missed training need- False positives are believed to
be at a high level of proficiency and are not given
remediation.
– Can happen on short tests containing items that
are susceptible to guessing
56
Example of chasing error
57
68% Confidence intervals for 20 item
subtest (flat profile near mean)
Statistic
Reliability
Reliability
.70
Type
False
Positives
False
weaknesses
5%
False strengths
1%
Did confidence
interval capture
the true score?
Subtest
Yes
No
1
68
32
2
64
36
3
69
31
4
73
27
5
60
40
66.8
33.2
Mean
100 replications across 100 different but similar forms
58
59
60
61
62
63
64
65
66
67
68
Probabilities of capturing the true
score across X profile scores
Cumulative
probability
Cumulative
probability
Cumulative
probability
• Using 95% confidence intervals increases the probability that
the true score will be captured.
• 68%
vs. 95%
• If reliability is low, 95% confidence intervals span such a large
69
portion of the scale, they lose the ability to show differences.
Math Grade 4: Example of High
Scatter (Rare)
= 68% Confidence Interval
= 95% Confidence Interval
70
Subscale Score Profiles
• What are important features of score profiles?
• What are common problems with interpreting
subscale score profiles on the ISR?
• What are best-practice guidelines to follow when
interpreting subscale scores?
Recommendations
Recommendation 1: Do not depend solely
on subscale score profiles that contain
small numbers of items to make important
instructional or intervention decisions.
72
Recommendations
• Classification error will be high
• Seek additional evidence and collateral
information on a student’s ability in a given
subscale.
• Be especially aware of subscale scores that
have very few items when interpreting an
ISR. (Use Test Information Sheets to identify short subtests
http://www.ncpublicschools.org/accountability/testing/eog/archives
/tisarchives)
73
Recommendation 2: If possible, augment
the results of the ISR score profiles with
reliable, diagnostic assessments that are
well aligned to the curriculum to assess
student strengths and weaknesses.
74
What kind of test will produce
good diagnostic information?
Subtests correlate Subtests do NOT
highly with total
correlate highly
score
with total score
Short subtests
Unreliable subscores Low reliability limits
and little or no value
utility of subtests
is added to total score
Long linear subtests or
short adaptive subtests
Reliable but no value
is added to the total
score
Ideal diagnostic
assessment
•EOG and EOC subtests administered operationally produce
subscale scores that correlate highly with total score (r = .70
to .95)
75
Guidelines
• What is the purpose of the test?
• How were the scores intended to be used
by the test developers?
• How reliable is the test score being
interpreted?
• Is the reliability of the test adequate for the
interpretation you are making? What are
the consequences of FP and FNs?
76
Recommended Reliabilities (Alpha)
•
•
•
•
.85 and above for high-stakes decisions
.75 to .85 for moderate stakes
.65 to .75 for low stakes
<.65 disregard
77
Other approaches to diagnostic
feedback
• Subscales in conjunction with Computer
Adaptive Assessments
• Item Maps in conjunction with nonadaptive or adaptive assessments
• Cognitive diagnostic modeling
• Formative Assessment
78
Guidelines for Identifying Learning
Needs
• Use diagnostic tests
• Use aligned tests
• Consider using released EOG/EOC
assessments
– Consider augmenting with additional items
to increase the number of items in each
goal
79
Recommendation 3: In the absence of
reliable subscale scores, focus training on
all subscale scores with proportionately
more time allocated to longer subtests.
80
Testing Materials
Test Information Sheet 1998 SCS
• http://www.ncpublicschools.org/accountability/testing/
eog/
• http://www.ncpublicschools.org/accountability/testing/
eoc/
Interpretive Guide for Winscan32
• http://www.ncpublicschools.org/accountability/te
sting/shared/abriefs/eoc
• http://www.ncpublicschools.org/accountability/te
sting/shared/abriefs/eog (soon to appear online)
81
Contact Information
• Garron Gianopulos
• ggianopulos@dpi.state.nc.us
Questions?
References
• Brennan, R. (2005). Some Test Theory for
the Reliability of Individual Profiles.
CASMA Research Report, No. 12.
• Dorans, N. Walker, M. (2007). Sizing Up
Linkages in Linking and Aligning Scores
and Scales. Springer Science+Business
Media,LLC. New York.
84
• Appendix
Standard 11.1
• Prior to the use of a published test, the test
user should study and evaluate the
materials provided by the test developer.
Of particular importance are the those that
summarize the test’s purposes, specify the
procedures for test administration, define
the intended population of test takers, and
discuss the score interpretations for which
validity and reliability data are available.
86
Standard 11.20
• Test taker’s scores should not be
interpreted in isolation; collateral
information that may lead to alternative
explanation for performance should be
considered.
87
Download