Developing the Vertical Scale Developing the Vertical Scale for the

advertisement
Developing the Vertical Scale Developing
the Vertical Scale
for the Connecticut Mastery Test
Hariharan Swaminathan
University of Connecticut
University of Connecticut
Thanks to
Barbara Beaudin
B
b
B
di
Bill Congero
Mohammed Dirir
Mohammed Dirir
Gil Andrada
Steve Martin
Steve Martin
H. Jane Rogers
CT State Department of Education
p
WWW.STATE.CT.GOV/SDE
Connecticut’ss Testing Context
Connecticut
Testing Context
• The Connecticut Mastery Test y
• Criterion‐referenced test
• First administered in 1985 to Grades 4, 6 and 8 • Subject: Mathematics, Reading, and Writing
• Performance levels within each grade: Below B i B i P fi i t G l d Ad
Basic, Basic, Proficient, Goal and Advanced
d
3
Connecticut’ss Testing Context
Connecticut
Testing Context
• In 2006, • Grades 3 through 8
• Science was added in grades 5 and 8
• Within each grade and subject area, test forms are equated across each generation of the test.
4
Testing Context (cont.)
Testing Context (cont.)
• The content of the Reading and Mathematics tests g
differs across grades and the standards (cut scores on a scale from 100 to 400) establishing the p
performance levels were developed independently p
p
y
for each grade.
• R
Results are reported at the individual child, by grade lt
t d t th i di id l hild b
d
within school, grade within district, grade within the state.
• Reports are also disaggregated by gender, race/ethnicity poverty special education and ELL
race/ethnicity, poverty, special education, and ELL status.
5
Testing Context (cont.)
Testing Context (cont.)
• Districts could use the results to answer questions like: How well are this year’s fifth grade students doing in mathematics?
grade students doing in mathematics?
• If the characteristics of the student populations, such as SES race/ethnicity are similar across
such as SES, race/ethnicity, are similar across schools within the district, they could compare p
performance across schools.
6
Testing Context (cont.)
Testing Context (cont.)
• Districts could compare the performance of p
p
subgroups of students: Are our fifth grade female students performing as well or better than their male counterparts in Mathematics?
p
• If last year’s fifth graders are similar to this year’s demographically, for comparative purposes they could answer: Are this ear’s st dents doing as ell or better in Mathematics
Are this year’s students doing as well or better in Mathematics than last year’s fifth graders?
• If
If the demographics of the district
the demographics of the district’ss student population are student population are
similar to other districts in the state they could answer the question: Are we doing as well or better in grade five Mathematics than these other districts with similar populations?
7
Testing Context (cont.)
NCLB AYP
• STATUS
STATUS MODEL
MODEL
States are required to set a target ‐ ANNUAL MEASUARABLE OBJECTIVE (AMO) – Percent of students achieving Proficiency or above
students achieving Proficiency or above.
AMO must be set in such a way that ALL students achieve proficiency in ALL subjects by 2014
p
y
j
y
• “GROWTH” MODEL
Performance of students in the same grade
f
f d
h
d is compared across years. Requires that the tests are “equated” from year to year –
q
y
y
HORIZONTAL EQUATING
8
Testing Context (cont.)
Testing Context (cont.)
• NCLB
NCLB Requirements:
Requirements:
With NCLB’s heightened emphasis on accountability, school administrators felt that the status measure (% (
Proficient) used to determine AYP was limited in conveying their student’s performance and the academic progress their students were making.
• This was particularly true for districts with large proportions of low‐income students. 9
Testing Context (cont.)
GROWTH ACROSS GRADES • Districts were eager to answer questions: How much have our students progressed or grown in Mathematics from Grade four to Grade five?
Mathematics from Grade four to Grade five?
• As
As a result, district staff, advocacy groups and the a res lt distri t staff ad o a
ro ps and the
media often used inappropriate comparisons like “an average increase of 23 scale score points from Grade
average increase of 23 scale score points from Grade four to Grade five” or “an increase of six percentage points in the proportion of students performing at the goal level from Grade four to five.”
10
Testing Context (cont.)
Testing Context (cont.)
• Reporting the scale increase would be inappropriate
R
ti th
l i
ld b i
i t
the fifth grade scale is independent of the fourth fifth grade scale is independent of the fourth
• the
grade scale
• they are not on a single scale.
11
Testing Context (cont.)
Testing Context (cont.)
• R
Reporting an increase in the percentage of students ti
i
i th
t
f t d t
scoring at or above the ‘goal’ level would be inappropriate • the fifth grade content is different from the fourth grade content
grade content p
• the standards are independent of each other. • ‘goal’ does not mean the same for both grades
• students could have grown between Grade four and five but not met the goal standard.
d fi b t t
t th
l t d d
12
VERTICAL SCALE
VERTICAL SCALE
• In 2005, Connecticut instituted a State Assigned ,
g
Student Identifier (SASID) for every public school student in the state.
• This provided a key for linking student h
d d k f l k
d
performance across grades and linking information about students across data files.
information about students across data files.
• 2006 was the first year of the new generation of the y
g
CMT administered to students in Grades 3 through 8.
13
VERTICAL SCALE
VERTICAL SCALE
• The structures were now in place to create a vertical scale for Mathematics and Reading to provide the Connecticut public with a vehicle to measure ‘growth” or progress over time.
14
What are the Merits of a Vertical Scale?
The new vertical scale
• ranges from 200 to 700
• allows
allows schools to measure student progress across schools to measure student progress across
a body of content knowledge ranging from Grade 3 to Grade 8 for Mathematics and Reading.
15
What are the Merits of a Vertical Scale?
l
l
• Each point on the scale represents a level of student understanding, independent of the grade the student g,
p
g
is in. • So, a score of 490 on the mathematics scale for a fourth grader represents the same level of
fourth grader represents the same level of knowledge as 490 for a fifth grader.
• ‘Growth’ can be measured across grades as the difference between the score in Year (n+ 1) and Year (n).
16
What Practical Solutions Does the Vertical Scale Provide?
h
l
l
d
• For Accountability, GROWTH
• provides an additional indicator of school performance (Value Added Models)
• supplements the status measure (performance level).
• provides an indicator for measuring the effectiveness of new curriculum or professional development initiatives (Value Added Models) 17
STATES THAT HAVE VERTICAL SCALES
STATES THAT HAVE VERTICAL SCALES
•
•
•
•
•
•
ARIZONA (3
ARIZONA
(3‐8)
8)
ARKANSAS (3‐8)
FLORIDA (3‐10)
O
(3 0)
IOWA ITBS (3‐8)
NORTH CAROLINA (3‐8)*
TENNESSEE (3‐8)*
TENNESSEE (3
8)
Developing the Vertical Scale
• Framework for Developing the Vertical Scale
• Data
D
C
Collection
ll i D
Design
i
• Scaling Procedures
• Adjusting
djust g tthe
e Tails
a so
of the
t e Scale
Sca e
• Value Added Models
19
Framework
• Constructing
Constructing a vertical scale requires a a vertical scale requires a
different measurement framework from “classical” methods
x
THE “OLD”
THE
OLD WAY : WAY :
CLASSICAL TEST THEORY
OBSERVED SCORE =
X
TRUE SCORE
T
+ ERROR
E
CLASSICAL TRUE SCORE MODEL
CLASSICAL TRUE SCORE MODEL
• Based on weak assumptions
• Both T and e are unobserved
• Intuitively appealing for its simplicity
• Model can be extended to study different sources of error
INDICES USED IN TRADITIONAL TEST CONSTRUCTION
Item Indices:
• Item difficulty – Proportion of examinees answering the item correctly
• Item discrimination – Correlation between Item Score and Total Score
Examinee Index:
• Test score
Test score
CLASSICAL ITEM and EXAMINEE CLASSICAL
ITEM and EXAMINEE
INDICES
• Item difficulty and Item Discrimination depend y
p
on the group of examinees
• Test Score depends on the items
T tS
d
d
th it
What’ss wrong with these indices?
What
wrong with these indices?
They are SAMPLE DEPENDENT
SAMPLE DEPENDENT
ii.e., they change as the sample of people or the e they change as the sample of people or the
sample of items change ‐ NOT INVARIANT
And what’ss wrong with that?
And what
wrong with that?
• We cannot compare item characteristics for items whose indices were computed on different subgroups of examinees
subgroups of examinees • We cannot compare the test scores of individuals who have taken different subsets of test items
Wouldn’tt it be nice if ….?
Wouldn
it be nice if ….?
• Our item indices did not depend on the characteristics of the sample of individuals on which the item data was obtained
• Our
Our examinee measures did not depend on the examinee measures did not depend on the
characteristics of the sample of items that was administered
ITEM RESPONSE THEORY solves the problem!*
*
Certain restrictions apply. Individual results may vary. IRT is Certain
restrictions apply Individual results may vary IRT is
not for everyone, including those with small samples. Side effects include yawning, drowsiness, difficulty swallowing, and nausea. If symptoms persist, consult a psychometrician. d
If
i
l
h
i i
For more information, see Hambleton and Swaminathan
(1985), and Hambleton, Swaminathan and Rogers (1991).
ITEM RESPONSE THEORY
postulates that
the probability of a correct response to an item depends on the ability of the examinee and the characteristics of the item ABILITY > DIFFICULTY
ABILITY < DIFFICULTY
PROBABILITY OF CORRECT ANSWER IS HIGH
ANSWER IS PROBABILITY OF CORRECT PROBABILITY
OF CORRECT
ANSWER IS LOW
The mathematical relationship between the p
probability of a response, the ability of the examinee, and the characteristics of the item is specified by the ITEM RESPONSE MODEL
Item Characteristics
Item Characteristics
An item may be characterized by its
DIFFICULTY level (usually denoted as b), level
(usually denoted as b)
DISCRIMINATION level (usually denoted by a), “PSEUDO‐CHANCE” level (usually denoted as c).
One‐Parameter Model
(θ −b )
e
P(correct
(
response)) =
(θ −b )
1+ e
θ
= trait value
= trait value
b = item difficulty
IRT ITEM DIFFICULTY
IRT ITEM DIFFICULTY
• Differs from classical item difficulty
• b
b is the θ
is the θ value at which the probability of a value at which the probability of a
correct response is .5
• The harder the item, the higher the b
• H
Hence, b is on the same scale as θ
bi
h
l
θ and does not dd
depend on the characteristics of the sample
One-Parameter (Rasch) Model
1.0
P robabillity of C orrectt
R eesponsee
09
0.9
0.8
b = -1.5
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
One-Parameter (Rasch) Model
P robabillity of C orrectt
R eesponsee
1.0
09
0.9
0.8
b = -1.5
0.7
b=0
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
One-Parameter (Rasch) Model
1.0
P robabillity of C orrectt
R eesponsee
09
0.9
0.8
b = -1.5
b=0
0.7
b = 1.5
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
Two‐Parameter Model
a (θ −b )
e
P(correct response) =
a (θ −b )
1+ e
θ
= trait value
b = item difficulty
a = it
item di
discrimination
i i ti
IRT ITEM DISCRIMINATION
IRT ITEM DISCRIMINATION
• Differs from classical item discrimination
• a is proportional to the slope of the ICC at θ = b
• The slope of the item response function indicates how much the probability of a correct response
how much the probability of a correct response changes for individuals with slightly different θ
, ,
values, i.e., how well the item discriminates
between them
Two-Parameter
Model
Two-Parameter Model
P
Probabi
Correct
lity of C
Correct
esponse
Re
esponseee
1.0
1.0
0.9
09
0.9
0.8
0.8
0.7
0.7
0.6
0.6
a - Discrimination - Slope
0.5
0.5
00.4
0.4
0 44
0.3
0.3
0.2
Difficulty - b
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta
Theta (Proficiency)
(Proficiency)
2
3
4
5
Two-Parameter Model
1.0
Probabi lity of C
P
Correct
Reesponsee
09
0.9
0.8
a = 0.5
b = ‐0
b = 0.5
5
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
5
-4
4
-3
3
-2
2
-1
1
0
1
Theta (Proficiency)
2
3
4
5
Two-Parameter Model
1.0
Probabi lity of C
P
Correct
Reesponsee
09
0.9
0.8
a = 0.5
b = ‐0
b = 0.5
5
a = 2.0
b = 0.0
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
5
-4
4
-3
3
-2
2
-1
1
0
1
Theta (Proficiency)
2
3
4
5
Two-Parameter Model
1.0
Probabi lity of C
P
Correct
Reesponsee
09
0.9
0.8
a = 0.5
b = -0
0.5
5
a = 2.0
b = 0.0
0.7
a = 0.8
b = 1.5
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
5
-4
4
-3
3
-2
2
-1
1
0
1
Theta (Proficiency)
2
3
4
5
Three‐Parameter Model
a (θ −b )
e
P(correct response) = c + (1 − c)
a (θ −b )
1+ e
c = pseudo‐chance level (“guessing”) parameter
IRT ITEM GUESSING PARAMETER
• No analog in classical test theory
• c is the probability that an examinee with very low θ will answer the item correctly
Three-Parameter Model
1.0
Probabillity of C
P
Correct
Reesponsee
09
0.9
0.8
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
c - Lower Asymptote
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
Three-Parameter Model
1.0
Probabillity of C
P
Correct
Reesponsee
09
0.9
a = 0.5
b = -0.5
c = 0.0
0.8
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-55
-44
-33
-22
-11
0
1
Theta (Proficiency)
2
3
4
5
Three-Parameter Model
1.0
Probabi lity of C
P
Correct
Reesponsee
09
0.9
a = 0.5
b = -0.5
c = 0.0
a = 2.0
b = 0.0
c = .25
0.8
0.7
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
5
-5
-4
4
-3
3
-2
2
-1
1
0
1
Theta (Proficiency)
2
3
4
5
Three-Parameter Model
1.0
Probabi lity of C
P
Correct
Reesponsee
09
0.9
a = 0.5
b = -0.5
c = 0.0
a = 2.0
b = 0.0
c = .25
0.8
0.7
a = 0.8
b = 1.5
c = 0.1
0.6
0.5
04
0.4
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
1
Theta (Proficiency)
2
3
4
5
BUT WAIT
BUT WAIT,
There’s more!
IRT Models for Polytomous
IRT
Models for Polytomous Item Item
Reponses
• Polytomous IRT models specify the probability IRT models specify the probability
that an examinee will respond in (or be scored in) a given response category
Graded Response Model
Graded Response Model
• For
For each response category, the probability that an each response category, the probability that an
examinee will respond in that category or higher vs. a lower category is given by:
a (θ −bk )
e
P(u ≥ k ) =
a (θ −bk )
1+ e
Graded Response Model
Graded Response Model
• The probability that the examinee will choose a given response category (or be scored in it) is thus
P(u=k)) = P(u
( ≥ k ) − P(u ≥ k + 1))
Partial Credit Model
Partial Credit Model
• For every pair of adjacent categories, specify the probability that the examinee will choose the higher of the two (or be scored in the higher):
a (θ −bk )
e
P(u = k| u=k or k+1) =
a (θ −bk )
1+ e
Partial Credit Model
Partial Credit Model
• This results in the model
k
P(u = k ) =
e
∑a(θ −br )
r =1
m−1
1+ ∑e
s =1
s
∑a(θ −bs )
r =1
Polytomous Item Response Model
1
Prrobability
0.8
0
4
06
0.6
1
2
04
0.4
3
0.2
0
-6
6
-4
4
-2
2
0
Proficiency
2
4
6
MORE?
No, that’s enough for now
ASSUMPTIONS UNDERLYING IRT
ASSUMPTIONS UNDERLYING IRT
- For the most commonly used models, unidimensionality of the latent trait is assumed
of the latent trait is assumed
‐ The
The mathematical model is assumed to correctly mathematical model is assumed to correctly
specify the relationship between the trait and item characteristics and performance on the item
p
FEATURES OF IRT
FEATURES OF IRT
When the model fits the data,
1
1.
Item parameters are independent of the It
t
i d
d t f th
population of examinees, i.e., invariant across sub‐groups
across sub
groups
2.
Ability parameters are independent of the yp
p
subsets of items administered
IMPLICATIONS
1.
Different examinees can take different items. Their ability values will be comparable! bl !
2
2.
Item parameter values will not be affected It
t
l
ill t b ff t d
when different samples of examinees respond to the item
to the item.
IMPLICATIONS FOR VERTICAL SCALE
IMPLICATIONS FOR VERTICAL SCALE
1.
We can compare the performance of examinees who take two different tests (
(e.g., Grade 4 and Grade 5) if …..
G d 4 d G d 5) if
2.
We can somehow LINK the items in the two tests and place them on the same scale.
How do we do this?
1
1.
We give Grade 4 and Grade 5 examinees We
give Grade 4 and Grade 5 examinees
some common items.
2.
Since the item parameter values of these common items will be the same for Grade 4 and Grade 5 examinees, we can use this information to place all the Grade 4 and Grade 5 items on a common scale. l
3.
Once the item parameter values are on a common scale, we can administer any subset of items to examinees in Grades 4 and 5. 4.
Since the ability values based on different subsets of items are on the same scale, we can compare the ability values of examinees in the
compare the ability values of examinees in the two grades.
5.
This design where examinees in two grades are given common items is known as the ANCHOR TEST DESIGN.
VERTICAL SCALING
VERTICAL SCALING
• IRT is used for all vertical scaling implementations
IRT i
df
ll
ti l
li i l
t ti
• Decisions to be made in carrying out vertical scaling with IRT:
• Data collection design
D
ll i d i
• Choice of IRT model
• Estimation Procedure
• Scaling procedure
Data Collection Design
Data Collection Design
• SScaling test design
li t t d i
• A test containing items that span all grades is administered to all students
administered to all students
• Common
Common item design
item design
• Examinees in each grade take the on‐grade test and some items from adjacent grade
test and some items from adjacent grade tests
Choice of Model
Choice of Model
• 1P
1P Model used in some state assessments (e.g., M d l
di
t t
t (
CT, DE)
• 3P Model used in other states (e.g., MI, FL)
Estimation Procedure
Estimation Procedure
• Diff
Different calibration programs use different t lib ti
diff
t
estimation procedures
• WINSTEPS fits a 1P/PCM using an WINSTEPS fits a 1P/PCM using an
unconditional maximum likelihood procedure
• BILOG‐MG, PARSCALE, MULTILOG use BILOG‐MG PARSCALE MULTILOG use
marginal maximum likelihood estimation
Scaling Procedures
Scaling Procedures
• Different IRT procedures are available:
• Separate group calibration
• Concurrent calibration
• Fixed parameter calibration
Data Collection Design
ITEMS
G3
G4
G5
G6
G7
G8
G3
OP33
SU34
SU45
G4
S
T
U
D
E
N
T
S
SU43
OP44
SU56
G5
G6
SU54
OP55
SU67
SU65
OP66
SU78
G7
SU76
OP77
G8
SU87
OP88
THETA DISTRIBUTION FOR MATHEMATICS ACROSS GRADES
THETA DISTRIBUTION FOR READING ACROSS GRADES
Vertically Scaled Scores
Vertically Scaled Scores
• The
The theta values are like Z scores theta values are like Z scores – the mean the mean
is zero and SD is 1.
• The theta values can be linearly transformed The theta values can be linearly transformed
(like Z scores) to span any range or to have a certain mean and SD
certain mean and SD.
• The theta scale is transformed to have a score range from 200 to 700.
f
200 700
CUT SCORES FOR PROFICIENCY LEVELS ‐MATH
650
S
c
a
l
e
d
600
550
Basic
500
Proficient
Goal
S
c
o
r
e
Advanced
450
400
350
3
4
5
6
GRADE
7
8
CUT SCORES FOR PROFICIENCY LEVELS ‐READING
600
Scaled Sccore
550
500
Basic
Goal
Proficient
450
Advanced
400
350
3
4
5
6
GRADE
7
8
The Tale of Two Tails: Adjustments to the scale
Adjustments to the scale
Figure 1. Relationship Between Raw Score and
80
70
Raw Score
R
e
60
50
40
30
20
10
0
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
Relationship of the Vertically Scaled p
y
Score (VSS) to r
The VSS is a linear transformation of θ r
Hence
r =
n
∑P
j =1
j
(V S S )
Relationship Between Raw Score and
Vertically
y Scaled Score
680
640
600
Vertically
y Scaled S
Score
560
520
480
440
400
360
320
280
240
200
160
120
80
40
0
0
5
10
15
20
25
30
35
40
45
50
Raw Score
55
60
65
70
75
80
Problem (?)
Problem (?)
Small changes in raw score at the extremes result in l
large changes in the Vertically Scaled Scores
h
i h
i ll
l d
This may cause concern to users when a child progresses from one grade to the next
Verttical Scale Scorre
Figure 2. Relationship Between Vertical Scaled score and Raw Score For Children
Progressing from Grade 3 to Grade 4 (Lower End - Reading)
420
410
400
390
380
370
360
350
340
330
320
310
300
290
280
270
260
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
0
5
10
15
20
25
Raw Score
30
35
40
45
50
Vertical Scale Score
Figure 3. Relationship between Vertical Scaled Score and Raw Score For
Child
Children
Progressing
P
i from
f
Grade
G d 3 to
t Grade
G d 4 (Upper
(U
End
E d - Reading)
R di )
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
490
480
470
460
450
440
430
420
410
400
390
380
370
360
35
40
45
50
55
60
Raw Score
65
70
75
80
85
Table 1. Gain in Unadjusted Vertical Scale Score Corresponding to an Increase of in Raw Score (LOWER END)
Raw Score
Gain in Unadjusted Vertical Scale Score Grade 4
Grade 3
Raw Score +1
Raw Score +2
Raw Score +3
Raw Score +4
0
87
120
139
154
1
57
76
91
102
2
44
58
70
79
3
39
50
60
68
4
36
46
54
61
5
34
42
50
56
6
33
40
47
53
7
32
39
45
50
8
31
37
43
48
9
31
36
41
46
10
30
35
40
45
11
30
35
39
44
12
29
34
38
43
13
29
33
38
42
14
29
33
37
41
15
28
32
36
40
Is this a Real Problem?
Is this a Real Problem?
¾ This is a natural consequence of a nonlinear This is a natural consequence of a nonlinear
transformation ¾ This is not a problem inherent in the vertical scale
This is not a problem inherent in the vertical scale
¾ Increases ( or decreases) at the extremes affect only a small percent of test takers
a small percent of test takers
¾ However, it is an issue of FACE VALIDITY that should be addressed Solution
¾ Adjustments are made to the two tails
j
¾ Linear interpolation at the extreme ends resolves the problem
bl
Vertic
cal Scale Score
Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End)
420
410
400
390
380
370
360
350
340
330
320
310
300
290
280
270
260
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
0
5
10
15
20
25
Raw Score
30
35
40
45
50
Vertic
cal Scale Score
Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End)
420
410
400
390
380
370
360
350
340
330
320
310
300
290
280
270
260
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
0
5
10
15
20
25
Raw Score
30
35
40
45
50
Vertic
cal Scale Score
Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End)
420
410
400
390
380
370
360
350
340
330
320
310
300
290
280
270
260
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
0
5
10
15
20
25
Raw Score
30
35
40
45
50
Steps
1. Choose a suitable region where the curve is almost linear – where the rate of change is almost a constant
where the rate of change is almost a constant
2. Fit a regression line/ or an analytical line and extend this line. 3
3.
Check to see if the growth (rate of change) is similar to
Check to see if the growth (rate of change) is similar to that in the middle of the score range 4. Carry out the procedure across two grades, e.g., from Grade 3 to 4 checking to make sure the growth from one
Grade 3 to 4, checking to make sure the growth from one grade to the next is almost uniform across the raw score range 5 Keeping the Grade 4 line fixed, carry out the procedure 5.
Keeping the Grade 4 line fixed carry out the procedure
across grades 4 and 5
6. If the change from grade 4 to 5 is not uniform, reexamine the growth from grades and modify the line(s) until the
the growth from grades and modify the line(s) until the growths from grades 3 to 4 and 4 to 5 are “reasonable”. 7. Repeat this process across all the grades. Table 2. Gain in Adjusted Vertical Scale Score Corresponding to an Increase in Raw Score (LOWER END)
to an Increase in Raw Score (LOWER END)
Raw Score
Gain using Adjusted Vertical Scale Score Grade 4
G d 3
Grade
0
Raw Score +1
30
Raw Score +1
34
Raw Score +1
39
Raw Score +1
43
1
29
34
39
43
2
29
34
39
43
3
4
29
29
34
34
38
38
43
43
5
29
34
38
43
6
29
33
38
43
7
29
33
38
42
8
29
33
38
43
9
28
33
38
43
10
11
28
28
33
33
38
38
43
42
12
29
33
38
42
13
28
33
37
41
14
28
32
37
40
15
28
32
36
40
Vertical Sca
V
ale Score
Figure 4. Adjustments to Grades 3 and 4 (Upper End)
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
490
480
470
460
450
440
430
420
410
400
390
380
370
360
35
40
45
50
55
60
Raw Score
65
70
75
80
85
Vertical Sca
V
ale Score
Figure 4. Adjustments to Grades 3 and 4 (Upper End)
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
490
480
470
460
450
440
430
420
410
400
390
380
370
360
35
40
45
50
55
60
Raw Score
65
70
75
80
85
Vertical Sca
V
ale Score
Figure 4. Adjustments to Grades 3 and 4 (Upper End)
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
490
480
470
460
450
440
430
420
410
400
390
380
370
360
35
40
45
50
55
60
Raw Score
65
70
75
80
85
Table 4. Gain in Adjusted Vertical Scale Score Corresponding to an Increase in Raw Score (Upper End)
Raw Score
Gain using Adjusted Vertical Scale Score at the upper end of the scale
Grade 3
Raw Score +1
Raw Score +1
Raw Score +1
Raw Score +1
70
11
15
20
24
71
11
15
20
24
72
11
16
20
24
73
11
16
20
24
74
11
16
20
25
75
11
16
20
25
76
11
16
20
25
77
11
16
20
25
78
12
16
20
25
79
12
16
20
25
80
12
16
21
25
81
12
16
21
82
12
16
83
12
84*
A Difficulty
A Difficulty
• A
A difficulty arises when test in one grade has a difficulty arises when test in one grade has a
different number of items than the test in another grade
• What happens when Grade 4 test has 84 items but Grade 5 test has 98 items?
• The curves will cross
Vertica
al Scaled Score
Figure 4. Relationship between Vertical Scaled Score
and Raw Score: Grade 4 to Grade 5 (Upper End - Reading)
710
700
690
680
670
660
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
490
480
470
460
450
440
430
420
50
55
60
65
70
75
Raw Score
80
85
90
95
100
• The
The curves cross at the raw score value of 56. curves cross at the raw score value of 56
• A student with a raw score of 70 in Grade 4 and 70 in Grade 5 will show a decline in the VSS of 15 VSS
Grade 5 will show a decline in the VSS of 15 VSS points (490 – 475 = ‐15). • A score of 70 in Grade 4 corresponds to percent A score of 70 in Grade 4 corresponds to percent
correct score of 83.3
• A score of 70 in Grade 5 corresponds to percent p
p
correct score of 71.4
A Difficulty? – Not really!
Report Percent Correct Scores
Report Percent Correct Scores
Recommendations
• IRT
IRT scaling results in a nonlinear relationship scaling results in a nonlinear relationship
between raw score and Vertically Scaled Score
• Small changes in raw scores at the extreme result in g
large changes in VSS at the extremes
• This is not a theoretical drawback of the VSS, but a perception problem for the test users
• This is an issue of face validity • Problem is addressed by simple adjustments at the extremes
Recommendations
• We
We do not recommend describing growth with do not recommend describing growth with
reference to raw scores
• If observed scores on the test have to be reported, p
,
use Percent Correct Scores. This avoids appearances of problems when tests have different number of raw score points
PROBLEMS WITH VERTICAL SCALING
• If the construct or dimension being measured changes across grades/years/ forms, scores on different forms mean different things and we
different forms mean different things and we cannot reasonably place scores on a common scale
• Martineau (2006) warns against using vertical scales in the presence of Construct Shift Above and below
in the presence of Construct Shift. Above and below grade items with scales for adjacent grades may reduce construct shift.
PROBLEMS WITH VERTICAL SCALING
• Both common person and common item designs have practical problems; items may be too easy
have practical problems; items may be too easy for one group and too hard for the other
• Must ensure that examinees have had exposure to content of common items or off‐level test
PROBLEMS WITH VERTICAL SCALING
• Scaled scores are not interpretable in terms of what a student knows or can do
student knows or can do
• Comparison of scores on scales that extend across p
several years is particularly risky and not recommended
Auty, W. (2008). Implementers guide to growth Models. Washington, DC. The Council of Chief State School Officers
Council of Chief State School Officers.
Lissitz, R.W., &Huynh, H. (2003). Vertical equating for state assessments. Practical Assessment, Research, & Evaluation, 8(10).
,
,
, ( )
Goldstein, J., & Behuniak, P. (2005). Growth models in action: Selected case studies. Practical Assessment, Research, & Evaluation, 10 (11).
Martineau, J. (2006). Distorting Value Added: The Use of Longitudinal, Vertically Scaled Student Achievement Data for Growth‐Based, Value‐Added Accountability. Journal of Educational and Behavioral Statistics, 31 (1).
bl
l f d
l d h
l
( )
Schaffer, W. (2006). Growth scales as an alternative to vertical scales. Practical Assessment Research & Evaluation 11(4)
Assessment, Research, & Evaluation, 11(4).
VALUE ADDED ASSESSMENT
MODELS AND ISSUES
Accountability
•
Schools and Districts are being held accountable for the performance of students
•
State Laws are being proposed to evaluate Teachers, Principals, & Superintendents on the b i f d
basis of student achievement
hi
Accountability
This trend is not new!
• Travisu, Italy, early 18th century: Teachers were not paid if students did not learn!
• Shogun: ultimate accountability! The teachers (entire village) to be put to death if the student (Blackthorne) does not learn Japanese. Approaches to Accountability
Approaches to Accountability
•
STATUS ASSESSMENT: Achievement of a pre‐set STATUS
ASSESSMENT A hi
t f
t
target (AYP)
•
GROWTH ASSESSMENT:
* Cross‐sectional (e.g., comparison of performance of students in Year 1 with that of students in the f t d t i Y 1 ith th t f t d t i th
SAME GRADE in Year 2)
* Longitudinal (assessment of growth of students from Year 1 to Year 2) Value‐Added
Value
Added Assessment
Assessment
•
Is basically an assessment of the effectiveness of a “treatment” by examining growth
•
Looks at the growth to determine if it is the result of the “value added” by the “treatment” •
Adjusts for “Initial Status” when assessing the “effect” of the treatment
•
In simple terms, it compares actual growth with “expected” growth.
Teacher Effectiveness:
Teacher
Effectiveness:
Value Added Assessment
“Treatment”
Teacher
Can Value
Can
Value‐Added
Added Assessment Models untangle Assessment Models untangle
teacher effects from non‐educational effects? Value Added Models
Value Added Models
•
•
•
•
Gain Score Models
Covariate Adjustment Models
Covariate Adjustment Models Layered/Complete Persistence Model
Variable Persistence Model
Variable Persistence Model (Lockwood, et al. 2007)
Sources & Resources
Sources & Resources
•
JEBS (2004) Volume 29, Number 1
•
Braun & Wainer (2007), Value Added Modeling. In C.R. Rao and S. Sinharay
C.R. Rao
and S. Sinharay (Eds). Handbook of (Eds). Handbook of
Psychometrics, Elsevier, Amsterdam, The Netherlands.
•
McCaffrey, Lockwood, Koretz, & Hamilton (2003). Evaluating Value‐ Added Models for Teacher Accountability. The Rand Corporation, Santa Monica, CA
Layered Model/
Complete Persistence Model
y i 1 = μ 1 + u1 + e i 1
y i 2 = μ 2 + u1 + u 2 + e i 2
( yi 2 − yi1 ) = ( μ2 − μ1 ) + u2 + (ei 2 − ei1 )
yik Student Score in Grade k
μk District Mean for Grade k ( FIXED)
uk Teacher " Effect " ( RANDOM )
eik residual
Layered Model (cont’d)
Layered Model (cont
d) •
•
•
•
•
Teacher effect persists
Teacher effect is common to all students (class effect?) N t
Not necessary to include student variable since t i l d t d t i bl i
each student serves as control
Gains are weakly correlated with background
Gains are weakly correlated with background variables Model can be expanded to include student characteristics and information on other subject areas Layered Model 2/Persistence Model (McCaffrey et al., 2004)
yi 3 = μ3 + yi 2 + u3 + ei 3
yi 4 = μ4 + yi 3 + α 43 u3 + u4 + ei 4
ytk = μtk + utk + etk
yi 5 = μ5 + yi 4 + α 53 u3 + α 54 u4 + u5 + ei 5
ytk Student Score in Grade k , Year t
μ tk District Mean in Grade k , Year t ( FIXED )
utk Teacher " Effect " ( RANDOM )
e tk error
Persistence Model
Persistence Model
•
•
•
•
Teacher effects diminish Teacher effect is common to all students (class effect?) Choice of model (complete persistence/ persistence models) an empirical question? Complete persistence model may be less sensitive to model misspecifications – but may be inferior if teacher effects dampen quickly.
Gain Score Model
Gain Score Model
yik − yi,k−1 = μk + β (Demo)i + uk + eit
/
t
Covariate Adjustment Model
yik = μk + γ k yi,k−1 + β (Demo
D )i + uk + eit
/
t
Issues
•
•
•
Well known that the two analyses may produce different results
different results
Lord’s Paradox
Care must be exercised in choosing a model
Care must be exercised in choosing a model
Multivariate Approaches
Multivariate Approaches
•
•
•
•
Previous analyses are done within years
y
y
Longitudinal data – “beg” for multivariate approach
Growth models spanning all the time periods can be incorporated within the multivariate framework
Alth
Although more general, difficult; not guaranteed h
l diffi lt
t
t d
to improve estimation.
Issues to be considered in Value Added Modeling
•
•
•
•
•
•
•
Database structure – number of cohorts, number of years number of subjects
of years, number of subjects
Number of levels (hierarchical model)
Use of covariates/background information
Use of covariates/background information
Problems with omitted variables
Missing data treatment
Missing data treatment
Teacher effects ‐ Random or Fixed
Causal Inference R d
Random VS Fixed Effects
VS Fi d Eff
•
•
Random effects – empirical/Pure Bayes
estimation of teacher effects
Draws strength from collateral information when sample size is not adequate
l i i
d
A ib i
Attribution of Cause
fC
•
Most troublesome aspect
•
Can Value‐Added Assessment untangle teacher C
V l Add d A
t t l t h
effects from non‐educational effects?
•
•
Are estimated teacher effects causal effects?
Are
estimated teacher effects causal effects?
While modeling is sophisticated, it cannot take into account differential interaction between teachers and students
h
d d
Other Issues
•
•
•
•
Use of achievement test scores:
Not valid for assessing teacher effectiveness Estimates of teacher effects must be linked to other indicators of good teaching
g raw scores/IRT scaled scores lead Effect of scaling –
to different estimates of teacher effects
While vertical scales are not necessary, scores used to assess gains must be scaled. Other Issues (Cont’d)
Other Issues (Cont
d)
•
Sensitivity of teacher effects to achievement test scores.
•
All subject areas do not provide comparable estimates of teacher effects (Lockwood, et al. i
f
h
ff
(L k
d
l
2007)
Conclusion
•
We have come a long way with respect to addressing the problem of Teacher dd
i th
bl
fT h
Effectiveness. Considerable advances have been made with statistical modeling. •
The issues that have yet to be addressed are:
D h
Do the achievement test scores have sufficient hi
h
ffi i
depth and breadth to serve as indicators of teacher effectiveness?
How can we establish the valid interpretation of teacher effectiveness measures/ indices? 
Download