Developing the Vertical Scale Developing the Vertical Scale for the Connecticut Mastery Test Hariharan Swaminathan University of Connecticut University of Connecticut Thanks to Barbara Beaudin B b B di Bill Congero Mohammed Dirir Mohammed Dirir Gil Andrada Steve Martin Steve Martin H. Jane Rogers CT State Department of Education p WWW.STATE.CT.GOV/SDE Connecticut’ss Testing Context Connecticut Testing Context • The Connecticut Mastery Test y • Criterion‐referenced test • First administered in 1985 to Grades 4, 6 and 8 • Subject: Mathematics, Reading, and Writing • Performance levels within each grade: Below B i B i P fi i t G l d Ad Basic, Basic, Proficient, Goal and Advanced d 3 Connecticut’ss Testing Context Connecticut Testing Context • In 2006, • Grades 3 through 8 • Science was added in grades 5 and 8 • Within each grade and subject area, test forms are equated across each generation of the test. 4 Testing Context (cont.) Testing Context (cont.) • The content of the Reading and Mathematics tests g differs across grades and the standards (cut scores on a scale from 100 to 400) establishing the p performance levels were developed independently p p y for each grade. • R Results are reported at the individual child, by grade lt t d t th i di id l hild b d within school, grade within district, grade within the state. • Reports are also disaggregated by gender, race/ethnicity poverty special education and ELL race/ethnicity, poverty, special education, and ELL status. 5 Testing Context (cont.) Testing Context (cont.) • Districts could use the results to answer questions like: How well are this year’s fifth grade students doing in mathematics? grade students doing in mathematics? • If the characteristics of the student populations, such as SES race/ethnicity are similar across such as SES, race/ethnicity, are similar across schools within the district, they could compare p performance across schools. 6 Testing Context (cont.) Testing Context (cont.) • Districts could compare the performance of p p subgroups of students: Are our fifth grade female students performing as well or better than their male counterparts in Mathematics? p • If last year’s fifth graders are similar to this year’s demographically, for comparative purposes they could answer: Are this ear’s st dents doing as ell or better in Mathematics Are this year’s students doing as well or better in Mathematics than last year’s fifth graders? • If If the demographics of the district the demographics of the district’ss student population are student population are similar to other districts in the state they could answer the question: Are we doing as well or better in grade five Mathematics than these other districts with similar populations? 7 Testing Context (cont.) NCLB AYP • STATUS STATUS MODEL MODEL States are required to set a target ‐ ANNUAL MEASUARABLE OBJECTIVE (AMO) – Percent of students achieving Proficiency or above students achieving Proficiency or above. AMO must be set in such a way that ALL students achieve proficiency in ALL subjects by 2014 p y j y • “GROWTH” MODEL Performance of students in the same grade f f d h d is compared across years. Requires that the tests are “equated” from year to year – q y y HORIZONTAL EQUATING 8 Testing Context (cont.) Testing Context (cont.) • NCLB NCLB Requirements: Requirements: With NCLB’s heightened emphasis on accountability, school administrators felt that the status measure (% ( Proficient) used to determine AYP was limited in conveying their student’s performance and the academic progress their students were making. • This was particularly true for districts with large proportions of low‐income students. 9 Testing Context (cont.) GROWTH ACROSS GRADES • Districts were eager to answer questions: How much have our students progressed or grown in Mathematics from Grade four to Grade five? Mathematics from Grade four to Grade five? • As As a result, district staff, advocacy groups and the a res lt distri t staff ad o a ro ps and the media often used inappropriate comparisons like “an average increase of 23 scale score points from Grade average increase of 23 scale score points from Grade four to Grade five” or “an increase of six percentage points in the proportion of students performing at the goal level from Grade four to five.” 10 Testing Context (cont.) Testing Context (cont.) • Reporting the scale increase would be inappropriate R ti th l i ld b i i t the fifth grade scale is independent of the fourth fifth grade scale is independent of the fourth • the grade scale • they are not on a single scale. 11 Testing Context (cont.) Testing Context (cont.) • R Reporting an increase in the percentage of students ti i i th t f t d t scoring at or above the ‘goal’ level would be inappropriate • the fifth grade content is different from the fourth grade content grade content p • the standards are independent of each other. • ‘goal’ does not mean the same for both grades • students could have grown between Grade four and five but not met the goal standard. d fi b t t t th l t d d 12 VERTICAL SCALE VERTICAL SCALE • In 2005, Connecticut instituted a State Assigned , g Student Identifier (SASID) for every public school student in the state. • This provided a key for linking student h d d k f l k d performance across grades and linking information about students across data files. information about students across data files. • 2006 was the first year of the new generation of the y g CMT administered to students in Grades 3 through 8. 13 VERTICAL SCALE VERTICAL SCALE • The structures were now in place to create a vertical scale for Mathematics and Reading to provide the Connecticut public with a vehicle to measure ‘growth” or progress over time. 14 What are the Merits of a Vertical Scale? The new vertical scale • ranges from 200 to 700 • allows allows schools to measure student progress across schools to measure student progress across a body of content knowledge ranging from Grade 3 to Grade 8 for Mathematics and Reading. 15 What are the Merits of a Vertical Scale? l l • Each point on the scale represents a level of student understanding, independent of the grade the student g, p g is in. • So, a score of 490 on the mathematics scale for a fourth grader represents the same level of fourth grader represents the same level of knowledge as 490 for a fifth grader. • ‘Growth’ can be measured across grades as the difference between the score in Year (n+ 1) and Year (n). 16 What Practical Solutions Does the Vertical Scale Provide? h l l d • For Accountability, GROWTH • provides an additional indicator of school performance (Value Added Models) • supplements the status measure (performance level). • provides an indicator for measuring the effectiveness of new curriculum or professional development initiatives (Value Added Models) 17 STATES THAT HAVE VERTICAL SCALES STATES THAT HAVE VERTICAL SCALES • • • • • • ARIZONA (3 ARIZONA (3‐8) 8) ARKANSAS (3‐8) FLORIDA (3‐10) O (3 0) IOWA ITBS (3‐8) NORTH CAROLINA (3‐8)* TENNESSEE (3‐8)* TENNESSEE (3 8) Developing the Vertical Scale • Framework for Developing the Vertical Scale • Data D C Collection ll i D Design i • Scaling Procedures • Adjusting djust g tthe e Tails a so of the t e Scale Sca e • Value Added Models 19 Framework • Constructing Constructing a vertical scale requires a a vertical scale requires a different measurement framework from “classical” methods x THE “OLD” THE OLD WAY : WAY : CLASSICAL TEST THEORY OBSERVED SCORE = X TRUE SCORE T + ERROR E CLASSICAL TRUE SCORE MODEL CLASSICAL TRUE SCORE MODEL • Based on weak assumptions • Both T and e are unobserved • Intuitively appealing for its simplicity • Model can be extended to study different sources of error INDICES USED IN TRADITIONAL TEST CONSTRUCTION Item Indices: • Item difficulty – Proportion of examinees answering the item correctly • Item discrimination – Correlation between Item Score and Total Score Examinee Index: • Test score Test score CLASSICAL ITEM and EXAMINEE CLASSICAL ITEM and EXAMINEE INDICES • Item difficulty and Item Discrimination depend y p on the group of examinees • Test Score depends on the items T tS d d th it What’ss wrong with these indices? What wrong with these indices? They are SAMPLE DEPENDENT SAMPLE DEPENDENT ii.e., they change as the sample of people or the e they change as the sample of people or the sample of items change ‐ NOT INVARIANT And what’ss wrong with that? And what wrong with that? • We cannot compare item characteristics for items whose indices were computed on different subgroups of examinees subgroups of examinees • We cannot compare the test scores of individuals who have taken different subsets of test items Wouldn’tt it be nice if ….? Wouldn it be nice if ….? • Our item indices did not depend on the characteristics of the sample of individuals on which the item data was obtained • Our Our examinee measures did not depend on the examinee measures did not depend on the characteristics of the sample of items that was administered ITEM RESPONSE THEORY solves the problem!* * Certain restrictions apply. Individual results may vary. IRT is Certain restrictions apply Individual results may vary IRT is not for everyone, including those with small samples. Side effects include yawning, drowsiness, difficulty swallowing, and nausea. If symptoms persist, consult a psychometrician. d If i l h i i For more information, see Hambleton and Swaminathan (1985), and Hambleton, Swaminathan and Rogers (1991). ITEM RESPONSE THEORY postulates that the probability of a correct response to an item depends on the ability of the examinee and the characteristics of the item ABILITY > DIFFICULTY ABILITY < DIFFICULTY PROBABILITY OF CORRECT ANSWER IS HIGH ANSWER IS PROBABILITY OF CORRECT PROBABILITY OF CORRECT ANSWER IS LOW The mathematical relationship between the p probability of a response, the ability of the examinee, and the characteristics of the item is specified by the ITEM RESPONSE MODEL Item Characteristics Item Characteristics An item may be characterized by its DIFFICULTY level (usually denoted as b), level (usually denoted as b) DISCRIMINATION level (usually denoted by a), “PSEUDO‐CHANCE” level (usually denoted as c). One‐Parameter Model (θ −b ) e P(correct ( response)) = (θ −b ) 1+ e θ = trait value = trait value b = item difficulty IRT ITEM DIFFICULTY IRT ITEM DIFFICULTY • Differs from classical item difficulty • b b is the θ is the θ value at which the probability of a value at which the probability of a correct response is .5 • The harder the item, the higher the b • H Hence, b is on the same scale as θ bi h l θ and does not dd depend on the characteristics of the sample One-Parameter (Rasch) Model 1.0 P robabillity of C orrectt R eesponsee 09 0.9 0.8 b = -1.5 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 One-Parameter (Rasch) Model P robabillity of C orrectt R eesponsee 1.0 09 0.9 0.8 b = -1.5 0.7 b=0 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 One-Parameter (Rasch) Model 1.0 P robabillity of C orrectt R eesponsee 09 0.9 0.8 b = -1.5 b=0 0.7 b = 1.5 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 Two‐Parameter Model a (θ −b ) e P(correct response) = a (θ −b ) 1+ e θ = trait value b = item difficulty a = it item di discrimination i i ti IRT ITEM DISCRIMINATION IRT ITEM DISCRIMINATION • Differs from classical item discrimination • a is proportional to the slope of the ICC at θ = b • The slope of the item response function indicates how much the probability of a correct response how much the probability of a correct response changes for individuals with slightly different θ , , values, i.e., how well the item discriminates between them Two-Parameter Model Two-Parameter Model P Probabi Correct lity of C Correct esponse Re esponseee 1.0 1.0 0.9 09 0.9 0.8 0.8 0.7 0.7 0.6 0.6 a - Discrimination - Slope 0.5 0.5 00.4 0.4 0 44 0.3 0.3 0.2 Difficulty - b 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta Theta (Proficiency) (Proficiency) 2 3 4 5 Two-Parameter Model 1.0 Probabi lity of C P Correct Reesponsee 09 0.9 0.8 a = 0.5 b = ‐0 b = 0.5 5 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 5 -4 4 -3 3 -2 2 -1 1 0 1 Theta (Proficiency) 2 3 4 5 Two-Parameter Model 1.0 Probabi lity of C P Correct Reesponsee 09 0.9 0.8 a = 0.5 b = ‐0 b = 0.5 5 a = 2.0 b = 0.0 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 5 -4 4 -3 3 -2 2 -1 1 0 1 Theta (Proficiency) 2 3 4 5 Two-Parameter Model 1.0 Probabi lity of C P Correct Reesponsee 09 0.9 0.8 a = 0.5 b = -0 0.5 5 a = 2.0 b = 0.0 0.7 a = 0.8 b = 1.5 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 5 -4 4 -3 3 -2 2 -1 1 0 1 Theta (Proficiency) 2 3 4 5 Three‐Parameter Model a (θ −b ) e P(correct response) = c + (1 − c) a (θ −b ) 1+ e c = pseudo‐chance level (“guessing”) parameter IRT ITEM GUESSING PARAMETER • No analog in classical test theory • c is the probability that an examinee with very low θ will answer the item correctly Three-Parameter Model 1.0 Probabillity of C P Correct Reesponsee 09 0.9 0.8 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 c - Lower Asymptote 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 Three-Parameter Model 1.0 Probabillity of C P Correct Reesponsee 09 0.9 a = 0.5 b = -0.5 c = 0.0 0.8 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -55 -44 -33 -22 -11 0 1 Theta (Proficiency) 2 3 4 5 Three-Parameter Model 1.0 Probabi lity of C P Correct Reesponsee 09 0.9 a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25 0.8 0.7 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 5 -5 -4 4 -3 3 -2 2 -1 1 0 1 Theta (Proficiency) 2 3 4 5 Three-Parameter Model 1.0 Probabi lity of C P Correct Reesponsee 09 0.9 a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25 0.8 0.7 a = 0.8 b = 1.5 c = 0.1 0.6 0.5 04 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 BUT WAIT BUT WAIT, There’s more! IRT Models for Polytomous IRT Models for Polytomous Item Item Reponses • Polytomous IRT models specify the probability IRT models specify the probability that an examinee will respond in (or be scored in) a given response category Graded Response Model Graded Response Model • For For each response category, the probability that an each response category, the probability that an examinee will respond in that category or higher vs. a lower category is given by: a (θ −bk ) e P(u ≥ k ) = a (θ −bk ) 1+ e Graded Response Model Graded Response Model • The probability that the examinee will choose a given response category (or be scored in it) is thus P(u=k)) = P(u ( ≥ k ) − P(u ≥ k + 1)) Partial Credit Model Partial Credit Model • For every pair of adjacent categories, specify the probability that the examinee will choose the higher of the two (or be scored in the higher): a (θ −bk ) e P(u = k| u=k or k+1) = a (θ −bk ) 1+ e Partial Credit Model Partial Credit Model • This results in the model k P(u = k ) = e ∑a(θ −br ) r =1 m−1 1+ ∑e s =1 s ∑a(θ −bs ) r =1 Polytomous Item Response Model 1 Prrobability 0.8 0 4 06 0.6 1 2 04 0.4 3 0.2 0 -6 6 -4 4 -2 2 0 Proficiency 2 4 6 MORE? No, that’s enough for now ASSUMPTIONS UNDERLYING IRT ASSUMPTIONS UNDERLYING IRT - For the most commonly used models, unidimensionality of the latent trait is assumed of the latent trait is assumed ‐ The The mathematical model is assumed to correctly mathematical model is assumed to correctly specify the relationship between the trait and item characteristics and performance on the item p FEATURES OF IRT FEATURES OF IRT When the model fits the data, 1 1. Item parameters are independent of the It t i d d t f th population of examinees, i.e., invariant across sub‐groups across sub groups 2. Ability parameters are independent of the yp p subsets of items administered IMPLICATIONS 1. Different examinees can take different items. Their ability values will be comparable! bl ! 2 2. Item parameter values will not be affected It t l ill t b ff t d when different samples of examinees respond to the item to the item. IMPLICATIONS FOR VERTICAL SCALE IMPLICATIONS FOR VERTICAL SCALE 1. We can compare the performance of examinees who take two different tests ( (e.g., Grade 4 and Grade 5) if ….. G d 4 d G d 5) if 2. We can somehow LINK the items in the two tests and place them on the same scale. How do we do this? 1 1. We give Grade 4 and Grade 5 examinees We give Grade 4 and Grade 5 examinees some common items. 2. Since the item parameter values of these common items will be the same for Grade 4 and Grade 5 examinees, we can use this information to place all the Grade 4 and Grade 5 items on a common scale. l 3. Once the item parameter values are on a common scale, we can administer any subset of items to examinees in Grades 4 and 5. 4. Since the ability values based on different subsets of items are on the same scale, we can compare the ability values of examinees in the compare the ability values of examinees in the two grades. 5. This design where examinees in two grades are given common items is known as the ANCHOR TEST DESIGN. VERTICAL SCALING VERTICAL SCALING • IRT is used for all vertical scaling implementations IRT i df ll ti l li i l t ti • Decisions to be made in carrying out vertical scaling with IRT: • Data collection design D ll i d i • Choice of IRT model • Estimation Procedure • Scaling procedure Data Collection Design Data Collection Design • SScaling test design li t t d i • A test containing items that span all grades is administered to all students administered to all students • Common Common item design item design • Examinees in each grade take the on‐grade test and some items from adjacent grade test and some items from adjacent grade tests Choice of Model Choice of Model • 1P 1P Model used in some state assessments (e.g., M d l di t t t ( CT, DE) • 3P Model used in other states (e.g., MI, FL) Estimation Procedure Estimation Procedure • Diff Different calibration programs use different t lib ti diff t estimation procedures • WINSTEPS fits a 1P/PCM using an WINSTEPS fits a 1P/PCM using an unconditional maximum likelihood procedure • BILOG‐MG, PARSCALE, MULTILOG use BILOG‐MG PARSCALE MULTILOG use marginal maximum likelihood estimation Scaling Procedures Scaling Procedures • Different IRT procedures are available: • Separate group calibration • Concurrent calibration • Fixed parameter calibration Data Collection Design ITEMS G3 G4 G5 G6 G7 G8 G3 OP33 SU34 SU45 G4 S T U D E N T S SU43 OP44 SU56 G5 G6 SU54 OP55 SU67 SU65 OP66 SU78 G7 SU76 OP77 G8 SU87 OP88 THETA DISTRIBUTION FOR MATHEMATICS ACROSS GRADES THETA DISTRIBUTION FOR READING ACROSS GRADES Vertically Scaled Scores Vertically Scaled Scores • The The theta values are like Z scores theta values are like Z scores – the mean the mean is zero and SD is 1. • The theta values can be linearly transformed The theta values can be linearly transformed (like Z scores) to span any range or to have a certain mean and SD certain mean and SD. • The theta scale is transformed to have a score range from 200 to 700. f 200 700 CUT SCORES FOR PROFICIENCY LEVELS ‐MATH 650 S c a l e d 600 550 Basic 500 Proficient Goal S c o r e Advanced 450 400 350 3 4 5 6 GRADE 7 8 CUT SCORES FOR PROFICIENCY LEVELS ‐READING 600 Scaled Sccore 550 500 Basic Goal Proficient 450 Advanced 400 350 3 4 5 6 GRADE 7 8 The Tale of Two Tails: Adjustments to the scale Adjustments to the scale Figure 1. Relationship Between Raw Score and 80 70 Raw Score R e 60 50 40 30 20 10 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 Relationship of the Vertically Scaled p y Score (VSS) to r The VSS is a linear transformation of θ r Hence r = n ∑P j =1 j (V S S ) Relationship Between Raw Score and Vertically y Scaled Score 680 640 600 Vertically y Scaled S Score 560 520 480 440 400 360 320 280 240 200 160 120 80 40 0 0 5 10 15 20 25 30 35 40 45 50 Raw Score 55 60 65 70 75 80 Problem (?) Problem (?) Small changes in raw score at the extremes result in l large changes in the Vertically Scaled Scores h i h i ll l d This may cause concern to users when a child progresses from one grade to the next Verttical Scale Scorre Figure 2. Relationship Between Vertical Scaled score and Raw Score For Children Progressing from Grade 3 to Grade 4 (Lower End - Reading) 420 410 400 390 380 370 360 350 340 330 320 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 0 5 10 15 20 25 Raw Score 30 35 40 45 50 Vertical Scale Score Figure 3. Relationship between Vertical Scaled Score and Raw Score For Child Children Progressing P i from f Grade G d 3 to t Grade G d 4 (Upper (U End E d - Reading) R di ) 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 410 400 390 380 370 360 35 40 45 50 55 60 Raw Score 65 70 75 80 85 Table 1. Gain in Unadjusted Vertical Scale Score Corresponding to an Increase of in Raw Score (LOWER END) Raw Score Gain in Unadjusted Vertical Scale Score Grade 4 Grade 3 Raw Score +1 Raw Score +2 Raw Score +3 Raw Score +4 0 87 120 139 154 1 57 76 91 102 2 44 58 70 79 3 39 50 60 68 4 36 46 54 61 5 34 42 50 56 6 33 40 47 53 7 32 39 45 50 8 31 37 43 48 9 31 36 41 46 10 30 35 40 45 11 30 35 39 44 12 29 34 38 43 13 29 33 38 42 14 29 33 37 41 15 28 32 36 40 Is this a Real Problem? Is this a Real Problem? ¾ This is a natural consequence of a nonlinear This is a natural consequence of a nonlinear transformation ¾ This is not a problem inherent in the vertical scale This is not a problem inherent in the vertical scale ¾ Increases ( or decreases) at the extremes affect only a small percent of test takers a small percent of test takers ¾ However, it is an issue of FACE VALIDITY that should be addressed Solution ¾ Adjustments are made to the two tails j ¾ Linear interpolation at the extreme ends resolves the problem bl Vertic cal Scale Score Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End) 420 410 400 390 380 370 360 350 340 330 320 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 0 5 10 15 20 25 Raw Score 30 35 40 45 50 Vertic cal Scale Score Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End) 420 410 400 390 380 370 360 350 340 330 320 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 0 5 10 15 20 25 Raw Score 30 35 40 45 50 Vertic cal Scale Score Figure 3. ADJUSTMENT TO GRADES 3 AND 4 – (Lower End) 420 410 400 390 380 370 360 350 340 330 320 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 0 5 10 15 20 25 Raw Score 30 35 40 45 50 Steps 1. Choose a suitable region where the curve is almost linear – where the rate of change is almost a constant where the rate of change is almost a constant 2. Fit a regression line/ or an analytical line and extend this line. 3 3. Check to see if the growth (rate of change) is similar to Check to see if the growth (rate of change) is similar to that in the middle of the score range 4. Carry out the procedure across two grades, e.g., from Grade 3 to 4 checking to make sure the growth from one Grade 3 to 4, checking to make sure the growth from one grade to the next is almost uniform across the raw score range 5 Keeping the Grade 4 line fixed, carry out the procedure 5. Keeping the Grade 4 line fixed carry out the procedure across grades 4 and 5 6. If the change from grade 4 to 5 is not uniform, reexamine the growth from grades and modify the line(s) until the the growth from grades and modify the line(s) until the growths from grades 3 to 4 and 4 to 5 are “reasonable”. 7. Repeat this process across all the grades. Table 2. Gain in Adjusted Vertical Scale Score Corresponding to an Increase in Raw Score (LOWER END) to an Increase in Raw Score (LOWER END) Raw Score Gain using Adjusted Vertical Scale Score Grade 4 G d 3 Grade 0 Raw Score +1 30 Raw Score +1 34 Raw Score +1 39 Raw Score +1 43 1 29 34 39 43 2 29 34 39 43 3 4 29 29 34 34 38 38 43 43 5 29 34 38 43 6 29 33 38 43 7 29 33 38 42 8 29 33 38 43 9 28 33 38 43 10 11 28 28 33 33 38 38 43 42 12 29 33 38 42 13 28 33 37 41 14 28 32 37 40 15 28 32 36 40 Vertical Sca V ale Score Figure 4. Adjustments to Grades 3 and 4 (Upper End) 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 410 400 390 380 370 360 35 40 45 50 55 60 Raw Score 65 70 75 80 85 Vertical Sca V ale Score Figure 4. Adjustments to Grades 3 and 4 (Upper End) 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 410 400 390 380 370 360 35 40 45 50 55 60 Raw Score 65 70 75 80 85 Vertical Sca V ale Score Figure 4. Adjustments to Grades 3 and 4 (Upper End) 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 410 400 390 380 370 360 35 40 45 50 55 60 Raw Score 65 70 75 80 85 Table 4. Gain in Adjusted Vertical Scale Score Corresponding to an Increase in Raw Score (Upper End) Raw Score Gain using Adjusted Vertical Scale Score at the upper end of the scale Grade 3 Raw Score +1 Raw Score +1 Raw Score +1 Raw Score +1 70 11 15 20 24 71 11 15 20 24 72 11 16 20 24 73 11 16 20 24 74 11 16 20 25 75 11 16 20 25 76 11 16 20 25 77 11 16 20 25 78 12 16 20 25 79 12 16 20 25 80 12 16 21 25 81 12 16 21 82 12 16 83 12 84* A Difficulty A Difficulty • A A difficulty arises when test in one grade has a difficulty arises when test in one grade has a different number of items than the test in another grade • What happens when Grade 4 test has 84 items but Grade 5 test has 98 items? • The curves will cross Vertica al Scaled Score Figure 4. Relationship between Vertical Scaled Score and Raw Score: Grade 4 to Grade 5 (Upper End - Reading) 710 700 690 680 670 660 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 50 55 60 65 70 75 Raw Score 80 85 90 95 100 • The The curves cross at the raw score value of 56. curves cross at the raw score value of 56 • A student with a raw score of 70 in Grade 4 and 70 in Grade 5 will show a decline in the VSS of 15 VSS Grade 5 will show a decline in the VSS of 15 VSS points (490 – 475 = ‐15). • A score of 70 in Grade 4 corresponds to percent A score of 70 in Grade 4 corresponds to percent correct score of 83.3 • A score of 70 in Grade 5 corresponds to percent p p correct score of 71.4 A Difficulty? – Not really! Report Percent Correct Scores Report Percent Correct Scores Recommendations • IRT IRT scaling results in a nonlinear relationship scaling results in a nonlinear relationship between raw score and Vertically Scaled Score • Small changes in raw scores at the extreme result in g large changes in VSS at the extremes • This is not a theoretical drawback of the VSS, but a perception problem for the test users • This is an issue of face validity • Problem is addressed by simple adjustments at the extremes Recommendations • We We do not recommend describing growth with do not recommend describing growth with reference to raw scores • If observed scores on the test have to be reported, p , use Percent Correct Scores. This avoids appearances of problems when tests have different number of raw score points PROBLEMS WITH VERTICAL SCALING • If the construct or dimension being measured changes across grades/years/ forms, scores on different forms mean different things and we different forms mean different things and we cannot reasonably place scores on a common scale • Martineau (2006) warns against using vertical scales in the presence of Construct Shift Above and below in the presence of Construct Shift. Above and below grade items with scales for adjacent grades may reduce construct shift. PROBLEMS WITH VERTICAL SCALING • Both common person and common item designs have practical problems; items may be too easy have practical problems; items may be too easy for one group and too hard for the other • Must ensure that examinees have had exposure to content of common items or off‐level test PROBLEMS WITH VERTICAL SCALING • Scaled scores are not interpretable in terms of what a student knows or can do student knows or can do • Comparison of scores on scales that extend across p several years is particularly risky and not recommended Auty, W. (2008). Implementers guide to growth Models. Washington, DC. The Council of Chief State School Officers Council of Chief State School Officers. Lissitz, R.W., &Huynh, H. (2003). Vertical equating for state assessments. Practical Assessment, Research, & Evaluation, 8(10). , , , ( ) Goldstein, J., & Behuniak, P. (2005). Growth models in action: Selected case studies. Practical Assessment, Research, & Evaluation, 10 (11). Martineau, J. (2006). Distorting Value Added: The Use of Longitudinal, Vertically Scaled Student Achievement Data for Growth‐Based, Value‐Added Accountability. Journal of Educational and Behavioral Statistics, 31 (1). bl l f d l d h l ( ) Schaffer, W. (2006). Growth scales as an alternative to vertical scales. Practical Assessment Research & Evaluation 11(4) Assessment, Research, & Evaluation, 11(4). VALUE ADDED ASSESSMENT MODELS AND ISSUES Accountability • Schools and Districts are being held accountable for the performance of students • State Laws are being proposed to evaluate Teachers, Principals, & Superintendents on the b i f d basis of student achievement hi Accountability This trend is not new! • Travisu, Italy, early 18th century: Teachers were not paid if students did not learn! • Shogun: ultimate accountability! The teachers (entire village) to be put to death if the student (Blackthorne) does not learn Japanese. Approaches to Accountability Approaches to Accountability • STATUS ASSESSMENT: Achievement of a pre‐set STATUS ASSESSMENT A hi t f t target (AYP) • GROWTH ASSESSMENT: * Cross‐sectional (e.g., comparison of performance of students in Year 1 with that of students in the f t d t i Y 1 ith th t f t d t i th SAME GRADE in Year 2) * Longitudinal (assessment of growth of students from Year 1 to Year 2) Value‐Added Value Added Assessment Assessment • Is basically an assessment of the effectiveness of a “treatment” by examining growth • Looks at the growth to determine if it is the result of the “value added” by the “treatment” • Adjusts for “Initial Status” when assessing the “effect” of the treatment • In simple terms, it compares actual growth with “expected” growth. Teacher Effectiveness: Teacher Effectiveness: Value Added Assessment “Treatment” Teacher Can Value Can Value‐Added Added Assessment Models untangle Assessment Models untangle teacher effects from non‐educational effects? Value Added Models Value Added Models • • • • Gain Score Models Covariate Adjustment Models Covariate Adjustment Models Layered/Complete Persistence Model Variable Persistence Model Variable Persistence Model (Lockwood, et al. 2007) Sources & Resources Sources & Resources • JEBS (2004) Volume 29, Number 1 • Braun & Wainer (2007), Value Added Modeling. In C.R. Rao and S. Sinharay C.R. Rao and S. Sinharay (Eds). Handbook of (Eds). Handbook of Psychometrics, Elsevier, Amsterdam, The Netherlands. • McCaffrey, Lockwood, Koretz, & Hamilton (2003). Evaluating Value‐ Added Models for Teacher Accountability. The Rand Corporation, Santa Monica, CA Layered Model/ Complete Persistence Model y i 1 = μ 1 + u1 + e i 1 y i 2 = μ 2 + u1 + u 2 + e i 2 ( yi 2 − yi1 ) = ( μ2 − μ1 ) + u2 + (ei 2 − ei1 ) yik Student Score in Grade k μk District Mean for Grade k ( FIXED) uk Teacher " Effect " ( RANDOM ) eik residual Layered Model (cont’d) Layered Model (cont d) • • • • • Teacher effect persists Teacher effect is common to all students (class effect?) N t Not necessary to include student variable since t i l d t d t i bl i each student serves as control Gains are weakly correlated with background Gains are weakly correlated with background variables Model can be expanded to include student characteristics and information on other subject areas Layered Model 2/Persistence Model (McCaffrey et al., 2004) yi 3 = μ3 + yi 2 + u3 + ei 3 yi 4 = μ4 + yi 3 + α 43 u3 + u4 + ei 4 ytk = μtk + utk + etk yi 5 = μ5 + yi 4 + α 53 u3 + α 54 u4 + u5 + ei 5 ytk Student Score in Grade k , Year t μ tk District Mean in Grade k , Year t ( FIXED ) utk Teacher " Effect " ( RANDOM ) e tk error Persistence Model Persistence Model • • • • Teacher effects diminish Teacher effect is common to all students (class effect?) Choice of model (complete persistence/ persistence models) an empirical question? Complete persistence model may be less sensitive to model misspecifications – but may be inferior if teacher effects dampen quickly. Gain Score Model Gain Score Model yik − yi,k−1 = μk + β (Demo)i + uk + eit / t Covariate Adjustment Model yik = μk + γ k yi,k−1 + β (Demo D )i + uk + eit / t Issues • • • Well known that the two analyses may produce different results different results Lord’s Paradox Care must be exercised in choosing a model Care must be exercised in choosing a model Multivariate Approaches Multivariate Approaches • • • • Previous analyses are done within years y y Longitudinal data – “beg” for multivariate approach Growth models spanning all the time periods can be incorporated within the multivariate framework Alth Although more general, difficult; not guaranteed h l diffi lt t t d to improve estimation. Issues to be considered in Value Added Modeling • • • • • • • Database structure – number of cohorts, number of years number of subjects of years, number of subjects Number of levels (hierarchical model) Use of covariates/background information Use of covariates/background information Problems with omitted variables Missing data treatment Missing data treatment Teacher effects ‐ Random or Fixed Causal Inference R d Random VS Fixed Effects VS Fi d Eff • • Random effects – empirical/Pure Bayes estimation of teacher effects Draws strength from collateral information when sample size is not adequate l i i d A ib i Attribution of Cause fC • Most troublesome aspect • Can Value‐Added Assessment untangle teacher C V l Add d A t t l t h effects from non‐educational effects? • • Are estimated teacher effects causal effects? Are estimated teacher effects causal effects? While modeling is sophisticated, it cannot take into account differential interaction between teachers and students h d d Other Issues • • • • Use of achievement test scores: Not valid for assessing teacher effectiveness Estimates of teacher effects must be linked to other indicators of good teaching g raw scores/IRT scaled scores lead Effect of scaling – to different estimates of teacher effects While vertical scales are not necessary, scores used to assess gains must be scaled. Other Issues (Cont’d) Other Issues (Cont d) • Sensitivity of teacher effects to achievement test scores. • All subject areas do not provide comparable estimates of teacher effects (Lockwood, et al. i f h ff (L k d l 2007) Conclusion • We have come a long way with respect to addressing the problem of Teacher dd i th bl fT h Effectiveness. Considerable advances have been made with statistical modeling. • The issues that have yet to be addressed are: D h Do the achievement test scores have sufficient hi h ffi i depth and breadth to serve as indicators of teacher effectiveness? How can we establish the valid interpretation of teacher effectiveness measures/ indices?