Response Pattern Pct. of Schools

advertisement
CREATE - National Evaluation Institute
Annual Conference – October 5-7, 2012
Educational Accountability and Teacher Evaluation:
Real Problems, Practical Solutions
Oversight of test administration:
Respect for educators, respect for standardization,
working together to focus
on measurement
Eliot Long
A*Star Audits, LLC
Brooklyn, NY
eliotlong@astaraudits.com
www.astaraudits.com
Talking about oversight
1.
Why?
Why is oversight necessary?
What are the problems and how significant are they?
Is this all worth the trouble?
2.
How?
How will we measure compliance?
How will we collect/analyze data?
What range of error – what is the misjudgment potential?
3.
What?
What will we do with the results of the oversight?
Costly, time consuming investigations?
Educator sanctions?
Lower test scores????
4.
Focus
What is our focus in choosing our methods and practices?
MEASUREMENT
requires
RELIABILITY
requires
STANDARDIZATION
Need for Oversight:
What are the problems we are seeking to fix?
Problems in test administration:
Confusion – misdirection
Misunderstanding, lack of preparation
Special directions for some, but not all, students
Inappropriate strategies for test-taking
Guessing strategies (Choose a middle size answer …)
Hurry, skim look for easy questions first
Improper assistance
Hints (Remember when to invert and multiply …)
Rereading, clarifying test questions
Brief instruction on test content
Cheating …
You wouldn’t want to be caught doing it.
See, for example, 19 ways to raise student scores:
Amrein-Beardsley, A., Berliner, D.C. & Rideau, S. (2010)
“Cheating in the first, second, and third degree: Educators’ responses to high-stakes testing”.
Educational Policy Analysis Archives, 18(14). Recovered from http://epaa.asu.edu/ojs/article/view/714
Outcomes of Improper Influence:
Undermines the usefulness of test scores
for program evaluation and accountability
Distribution of Number Correct Scores
E stima te d S kills Ba se d a nd Obse rve d N umbe r Corre ct S core s
Est. True Score Distribution
Observed Score Distribution
5%
1. Reduce the range of measure with respect
to the true range of test-taker achievement.
Reduce general pop. variance by 30%
Frequency - Percent of All Students
Encouraged guessing is found to:
4%
3%
2%
1%
0%
2. Create widely varying classroom to
classroom test score reliability.
0
3
6
9
18
21
24
27
30
33
36
39
T ru e v s. O b se rv e d G a i n s
a t M in . S c o re fo r P a ssi n g
30%
20%
O b se rv e d
T ru e
Percent Gain or Decline
Mask true achievement gains by 35%
for students at Basic Performance
15
Grade 5 Reading Test - Number Correct Score
Reduce general pop. reliability by ?
3. Create a test score modulator – with changes
in achievement and proctor influence moving
in the opposite, offsetting, direction.
12
10%
0%
- 10%
- 20%
- 30%
42
45
Evaluating student response patterns
for evidence of improper influence
Student response patterns:
Taken as a group, students in a classroom succeed or fail with the individual test
questions in a pattern that:
1. Reflects the difficulty of
the test questions and
2. Follows the same fluctuations
as the patterns for other
classrooms at the same
achievement level.
A band of normative deviations
may be established around the
norm for any one achievement
level.
Evaluating a classroom
to its achievement level (peer group) norm
Skill level norm:
All classrooms at the same achievement level
set a peer group or ‘skill level’ norm.
P-value correlation:
One method of comparison
is a correlation of the class
and skill level p-values.
Here, for a 50 item test,
n = 50; r = .95
Percent attempted:
The line with stars indicates
the percent of students who
answer each item.
Skill Level
Norm
Consistency within
and across Assessment Settings
Consistency of Group Test Administrations
When all group profiles are correlated with their appropriate skill
level norms, the distribution of correlation coefficients indicates
the level of test administration consistency
within the assessment setting.
Consistency of Educational
vs. Industrial Assessment
A comparison of classroom
groups with job applicant groups
(tested by employers) indicates
lower consistency in classroom
test administrations.
Schools median
Classrooms median
Employers median
r = .900
r = .907
r = .958
Proctor Influence to Guess
Two Low Performing Classrooms
Two Different Forms of Encouragement to Guess
Encouragement to
Guess at Random
Encouragement to
Guess by choosing ‘C’
Guessing Effects within the Classroom
Teacher encouragement to guess is challenged by the variation in student
achievement and need to guess, potentially resulting in under assessment.
The class below has a poor
correlation with the norm (.74).
Guessing by some students and
teacher actions to encourage it
contradict norm patterns.
n =8
RS = 29.4
r Corr. = .80
Full class: n = 18; RS = 22.3; r Corr. = .74
Low performance
due to guessing?
n = 10
RS = 16.6
r Corr. = .44
25% Correct

Proctor Effects on Summer School Gain
Indications from Summer School Two major patterns for test work behavior inferred from results.
1. Proctor directions:
Answer the questions in order,
to take time, work carefully,
and reserved guessing
until the end of the session.
.
2. Proctor direction to either:
a) First skim the test to
look for the easy questions,
then guess, or
b) Hurry from the beginning,
don’t waste time on difficult
questions, guess and move
on.
Improper Proctor Influence
Proctor influence ranges from positive to moderately negative
to a serious undermining of the assessment. Significant improper
influence leads to measurable deviations in classroom response
patterns.
Response pattern probability: P < 0.01
Correlation with the norm: r = .713
Response pattern probability: P < 0.001
Correlation with the norm: .579
A*Star® P-value Profile
Moderate scoring class (n=17) with improbable elevated p-values
Response pattern probability: P < 0.01
Skill Level Norm RS 27.5
Class (n=17) Profile
A*Star® P-value Profile
Average scoring class with improbable elevated p-values
Response pattern probability: P < 0.001
Skill Level Norm RS 30.5
Percent Attempted
Percent Attempted
120%
Percent Correct of Attempts - P-value
120%
Percent Correct of Attempts - P-value
Class (n=19) Profile
100%
80%
60%
40%
20%
100%
80%
60%
40%
20%
0%
0%
1
4
7
10
13
16
19
22
25
28
31
34
37
Grade 6 Reading Test - Test Question - Item Number
40
43
46
49
1
4
7
10
13
16
19
22
25
28
31
34
37
Grade 6 Reading Test - Test Question - Item Number
40
43
46
49
Subject Group Analysis
Identify those test-takers most likely to have been
the subject of improper proctor influence
- Determine the expected frequency at each answer alternative for each
total test score based on observed frequencies in the state population
- Identify item responses consistent with the group pattern irregularity
- Identify the joint likelihood of these item responses at each score level
- Apply the likelihood estimate to each score level in the group
- Sum the observed and expected frequencies for the group
- Compare the observed to the expected frequency via the binomial
probability distribution.
Determine the # of students and the # test answers involved
and the likelihood of this combination occurring in classrooms
of the same size and at the same achievement level.
Subject Group Analysis
Identify those most likely subject to improper influence
Most often, improper teacher influence is unplanned and disorganized.
Yet, where the influence is persistent, subsets of students will be identified
with matching, unlikely response patterns.
Subject Group: n = 8 of 17
Response pattern probability P = 3.43e-9
Correlation with the norm: r = .498
Subject Group: n = 12 of 19
Response pattern probability: P = 6.68e-15
Correlation with the norm: r = .342
A*Star® P-value Profile
Comparison of Subject Group (n=12) to Class Skill Level Norm
A*Star® P-value Profile
Comparison of Subject Group (n=8) to Class Skill Level Norm
Subset of students likely to have received assistance - Response pattern probability: P = 6.68e-15
Subset of students likely to have received assistance - Response pattern probability: P = 3.43e-9
Skill Level Norm RS 29.3
Skill Level Norm RS 33.2
Subject Group Profile
120%
Percent Correct of Attempts - P-value
120%
Percent Correct of Attempts - P-value
Subject Group Profile
100%
80%
60%
40%
20%
100%
80%
60%
40%
20%
0%
0%
1
4
7
10
13
16
19
22
25
28
31
34
37
Grade 6 Reading Test - Test Question - Item Number
40
43
46
49
1
4
7
10
13
16
19
22
25
28
31
34
37
Grade 6 Reading Test - Test Question - Item Number
40
43
46
49
School Administrator Involvement
School administrator involvement is indicated
when a highly irregular (highly improbable)
response pattern is found that crosses over classrooms.
Full school (n=69) response pattern
reveals a substantial irregularity over
the early test items.
Subset group (n=30) includes students
from several classrooms, indicating an
influence from outside the classroom.
Confirmation of Improper Influence
How do we know that irregular response patterns
indicate improper test administration?
Confirmation for statistical analysis
Testing program: “Ability-To-Benefit” testing: Basic reading and math skills
Program of the Office of Federal Student Aid
Analyses by:
Most major test publishers
Reviewed by:
U. S. Dept. of Education, Office of Inspector General
OIG Report: Final Management Information Report, Jan. 25, 2010
Available at: www2.ed.gov/about/offices/list/oig/
alternativeproducts/x11j0002.pdf
Summary:
OIG data analytics project investigated 106 test
administrators indicated by the A*Star method;
83 were identified by the OIG while an unspecified
number of others were not investigated due to their
small number of test administrations after applying
the statute of limitations.
Defining “Significant” Misadministration
What constitutes “significant” cases of misadministration (cheating)?”
Number of test items effected
Improper influence on any test item is wrong, but influence on only a few
items is more likely an effort to facilitate the test administration rather than
to materially raise test scores.
Number of students involved
My sense of it is that a large number of items for a few students is a greater
problem than a few items for a large number of students – the latter may be
a perceived problem with the items while the former an effort to raise the
scores of lower performing students.
Improbability of response pattern
Any probability less than 1 in 10,000 is significant, but common, wrong
answers create unusually low probabilities that may overshadow more
important problems. A “six sigma” approach is conservative.
Definition used here:
Minimum 10% of test items
Minimum #SGA students times #SGA items = 5% of all responses
Probability less than 1 in 100,000 (less than 10 in one million)
Audit Summary of Large Urban District - 2001
Pct. of
Group
Responses Freq.
2001
Urban District
Elementary Schools
Grade 5 Math
Chart of schools by the A*Star
analysis of each school’s Grade 5
Math test response pattern – plot
by the volume of test responses
potentially subject to improper
influence and by the improbability
of the pattern occurring in a
normative test administration.
Response Pattern
Consistent with norms
Modest irregularity
Severe irregularity
Pct. of Schools
85%
12%
3%
25.0%
24.5%
24.0%
23.5%
23.0%
22.5%
22.0%
21.5%
21.0%
20.5%
20.0%
19.5%
19.0%
18.5%
18.0%
17.5%
17.0%
16.5%
16.0%
15.5%
15.0%
14.5%
14.0%
13.5%
13.0%
12.5%
12.0%
11.5%
11.0%
10.5%
10.0%
9.5%
9.0%
8.5%
8.0%
7.5%
7.0%
6.5%
6.0%
5.5%
5.0%
4.5%
4.0%
3.5%
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
Total
Percent
SGA Prob.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
8
3
1
2
28
2
15
9
136
30
33
14
196
6
1
0
181
SGA Probability Exponent
n/s
E-01
E-02
E-03
E-04
26
E-05
E-06
E-07
E-08
E-09
E-10
4%
E-11
E-12
20
3%
1
1
E-13-45
1
1
1
17
9
10
3
63
2
1
4
1
1
2
1
1
4
1
5
0
40
12
8
7
70
3
11
1
2
4
37
3
4
2
32
567
85%
1
1
6
3
2
19
3
5
2
12
1
1
3
1
1
5
1
1
1
5
2
2
7
1
1
2
7
3
2
1
1
1
2
3
1
1
1
1
1
54
8%
1
181
667
181
100.0%
27.1%
n/s
7
1.0%
E-01
104
152
96
15.6%
22.8%
14.4%
E-02
E-03
E-04
Probable
53
7.9%
E-05
19
2.8%
E-06
Improbable
15
2.2%
E-07
12
1.8%
E-08
7
1.0%
E-09
7
1.0%
E-10
3
0.4%
E-11
Highly Improbable
2
0.3%
E-12
4
0.6%
E-13-45
Ability-To-Benefit Testing - 2002-2005
Reviewed by Office of Inspector General
Pct. of
Group
Responses Freq.
2002-2005
Nationally distributed
Occupational Training Schools
Basic Math Skills
Chart of schools by the A*Star
analysis of each school’s student
applicants’ test response pattern.
Plot by the volume of test responses
potentially subject to improper
influence and by the improbability
of the pattern occurring in a
normative test administration.
Response Pattern
Consistent with norms
Modest irregularity
Severe irregularity
Pct. of Schools
67%
19%
14%
25.0%
24.5%
24.0%
23.5%
23.0%
22.5%
22.0%
21.5%
21.0%
20.5%
20.0%
19.5%
19.0%
18.5%
18.0%
17.5%
17.0%
16.5%
16.0%
15.5%
15.0%
14.5%
14.0%
13.5%
13.0%
12.5%
12.0%
11.5%
11.0%
10.5%
10.0%
9.5%
9.0%
8.5%
8.0%
7.5%
7.0%
6.5%
6.0%
5.5%
5.0%
4.5%
4.0%
3.5%
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
Total
Percent
SGA Prob.
0
1
0
0
0
0
0
0
0
0
1
0
0
1
2
1
1
0
6
0
0
0
0
10
0
1
1
7
1
1
0
0
11
0
1
0
17
1
1
0
2
21
1
0
0
7
0
0
0
0
0
0
750
100.0%
SGA Probability Exponent
n/s
E-01
E-02
E-03
E-04
E-05
E-06
E-07
E-08
E-09
E-10
E-11
E-12
E-13-45
1
1
1
2
1
1
1
32
6
4%
104
14%
11
1
1
2
1
1
1
2
1
1
5
1
1
4
4
8
3
2
199
199
26.5%
n/s
22
2.9%
E-01
1
15
6
2
10
27
6
6
3
6
6
1
8
35
19
17
6
1
503
67%
100
85
2
2
4
2
2
3
11
6
6
5
19
8
3
4
15
6
6
3
10
1
1
1
2
1
1
1
1
1
8
1
5
3
3
2
2
6
1
1
13.3%
11.3%
10.0%
E-03
E-04
54
7.2%
E-05
26
3.5%
E-06
Improbable
20
2.7%
E-07
2
1
3
1
4
1
1
2
1
1
2
21
2
1
1
2
25
2
1
1
4
1
5
3
1
13
1
1
1
1
2
3
1
2
1
1
1
3
1
1
2
1
7
1
111
75
E-02
Probable
1
2
6
1
1
7
1
1
11
1.5%
E-08
15%
15
2.0%
E-09
8
1.1%
E-10
13
10
1.7%
E-11
1.3%
E-12
Highly Improbable
112
14.9%
E-13-45
East Coast State - 2008
Pct. of All
Responses
East Coast Statewide Review
Elementary Schools
Grade 4 Math
Chart of schools by the A*Star
analysis of each school’s Grade 4
Math test response pattern – plot
by the volume of test responses
potentially subject to improper
influence and by the improbability
of the pattern occurring in a
normative test administration.
Response Pattern
Consistent with norms
Modest irregularity
Severe irregularity
Pct. of Schools
37%
41%
22%
25.0%
24.5%
24.0%
23.5%
23.0%
22.5%
22.0%
21.5%
21.0%
20.5%
20.0%
19.5%
19.0%
18.5%
18.0%
17.5%
17.0%
16.5%
16.0%
15.5%
15.0%
14.5%
14.0%
13.5%
13.0%
12.5%
12.0%
11.5%
11.0%
10.5%
10.0%
9.5%
9.0%
8.5%
8.0%
7.5%
7.0%
6.5%
6.0%
5.5%
5.0%
4.5%
4.0%
3.5%
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
SGA Probability Exponent
Freq.
E-01
E-02
E-03
E-04
E-05
E-06
E-07
E-08
E-09
E-10
E-11
E-12
E-13-45
1
1
506
39%
289
2
1
7
2
2
1
1
2
2
3
0
0
1
7
2
1
1
2
3
2
0
24
6
13
12
5
12
69
9
6
3
2
17
4
8
5
9
6
86
10
14
4
8
9
88
2
1
1
479
1,311
100.0%
1
1
1
1
1
1
33
8
4
6
15
12
131
24
22
22
46
33
312
53
54
39
34
29
240
14
10
2
Total
1
1
2
1
1
1
3
148
22%
1
2
Percent
SGA Prob.
n/s
6
1
2
2
34
7
6
3
8
3
84
10
5
3
9
3
44
2
1
3
3
1
1
4
4
18
1
1
3
5
5
47
9
6
6
5
4
18
7
1
1
2
0
25
4
2
3
4
6
27
7
8
8
4
4
1
3
1
1
1
1
13
2
1
4
3
3
15
6
5
1
3
6
1
1
1
4
1
2
1
1
7
5
1
1
7
1
7
2
6
3
1
3
1
2
1
1
1
4
4
9
1
1
2
3
5
1
1
2
2
1
1
2
1
1
1
2
4
1
4
1
1
1
2
1
4
1
37%
1
37
3%
148
148
11.3%
n/s
15
1.1%
E-01
165
12.6%
E-02
Probable
279
21.3%
E-03
231
17.6%
E-04
147
11.2%
E-05
120
9.2%
E-06
Improbable
67
5.1%
E-07
38
2.9%
E-08
41
3.1%
E-09
20
1.5%
E-10
8
0.6%
E-11
Highly Improbable
5
0.4%
E-12
27
2.1%
E-13-45
Addressing the problem
There are significant deviations
from standardized test administration
that materially undermine the usefulness of test scores.
What do we do about it?
- Comprehensive instructions for school administrators and teachers
- Regular review of test results for misadministration
- Communication with educators when irregularities arise
- Reserve sanctions for persistent cases of misadministration.
Case in point: Whole test manipulation
Same Teacher – Two Successive Years
The first year
- begins normally and becomes increasingly irregular.
The second year
- begins irregular and continues over the entire test
suggesting a preplanned intent to control the testing outcome.
Second Year - Grade 5 Reading
Correlation with the norm: r = .487
First Year
Grade 5
Reading
r = .750
First Year
Grade 5
Math
r = .527
Not caught by erasure analysis due to
low pct. wrong to right (40%).
Self Correction - following notice
Oversight improves proctoring
Following the second year reading test, the teacher was notified that
her testing practices were under investigation Three weeks later, her
administration of the math test was remarkably improved.
Note:
The teacher was not given
any instruction on how to
change her test administration
practices. She was only told
that irregularities had been
found in her students’ test
answers. The next test
administration resulted in an
essentially perfect response
pattern.
Grade 5 Math – Second Year
Pattern correlation with the norm: r = .947
Oversight Program Steps
Step 1: Directions for Test Administration
Provide comprehensive directions for
managing the classroom
and conducting the test administration
Address all issues that arise in test administration
For example:
1. How students should respond when they do not know how to respond.
i.e. Guess? Use ‘test wise’ guessing strategies? Leave blank?
2. How to deal with apparent misalignment of the test with the curriculum.
i.e. When there is test material not included in classroom instruction.
3. What should the teacher do when she/he notices a student mark a
wrong answer when she/he knows the student knows the correct answer?
Test directions should be rewritten following meetings with school
administrators and teachers to review response pattern research
and address all issues challenging standardized test administration.
Oversight Program Steps
Step 2: Conduct Regular Review of Test Response Patterns
Conduct annual review of test response patterns,
for each test, by classroom and by school.
The review should:
Identify:
1. Problems in the test construction or in the directions for
test administration.
2. Locations with irregular results and likely misadministration.
Provide:
3. Trend analysis for each classroom and school
Have past irregularities been cured? Is there a sudden
change in the qualify of test administration?
4. Resource to evaluate complaints or allegations of misconduct.
5. Resource to evaluate test score input to teacher evaluation.
Oversight Program Steps
Step 3: Report review results to administrators & educators
Report results and recommendations
Prepare a written report for each administrator and teacher/test
administrator.
1. Measure of consistency with norms for the
same achievement level.
2. To the extent possible, indicate areas
of test administration procedures
for future improvement.
Oversight Program Steps
Step 4:Remedial steps for instances of
particularly severe or repeated irregularities
Remedial Steps:
1. Meet with assessment liaison or trainer to review areas of
needed improvement.
2. Assignment to professional development test administration program
3. Assign monitor for next assessment session.
4. Provide substitute test administrator for next test session
5. Conduct investigation leading to potential sanctions.
Summary of Oversight of Test Administration
A focus on measurement through standardized assessment
Oversight of Test Administration
1. Provide a comprehensive set of written directions for school
administrators and teachers.
2. Conduct annual reviews of test response patterns.
3. Provide timely test administration reports to all involved.
4. Provide a series of steps to inform, train, and motivate test
administrators to improve practices where necessary.
5. Provide sanctions for test administrators who fail to improve
practices over multiple test administrations.
6. Use experience to improve directions, methods of analysis and
interpretation, and methods of communication and training.
Download