PSSA-M: An Exploration of Eligibility and Performance Within and

advertisement
Detecting Item Parameter Drift
in a CAT program
using the Rasch Measurement Model
Mayuko Simon, David Chayer, Pam Hermann, and Yi Du
Data Recognition Corporation
April, 2012
How should banked item parameters
be checked?
• The idea for this study came about when
the authors were faced with a large
existing bank of CAT items with estimated
item parameters that needed
augmentation.
Re-calibration of banked item
parameters and item parameter drift
• Recalibration is recommended at periodic
interval
• CAT item data is sparse matrix and range
of students’ ability for each item are
limited
What would be a reasonable
way to recalibrate items?
• The methods can be applied to
– Maintenance of CAT item bank
– Detecting item parameter drift
– Calibration of field test items
How did other researchers
calibrate/re-calibrate CAT data?
• Missing imputation to avoid sparseness
(Harmes, Parshall, and Kromrey, 2003)
• Calibrate FT items by anchoring operational
items (Wang and Wiley, 2004)
• Calibrate FT item anchoring ability (Kingsbury, 2009)
• Use ability to calibrate item parameter to
detect drift (Stocking, 1988)
Simulation study
• 300 items in item bank
• 20,000 students’ simulated responses,
N(0,1)
• Known item parameter drift (10% of
item bank)
• Various drift sizes
Design
Item difficulty
# of
Item parameter drift size
items
Condition 1
Condition 2
Control
Condition
Easy
10
d < -1.5
0.1, 0.2, 0.3, 0.4, 0.5 -0.1,- 0.2,- 0.3,-
No change
0.4,- 0.5, 0.1, 0.2,
0.3, 0.4, 0.5
Medium
10
-1.5 ≤ d ≤ 1.5
0.1, 0.2, 0.3, 0.4, 0.5 -0.1,- 0.2,- 0.3,-
No change
0.4,- 0.5, 0.1, 0.2,
0.3, 0.4, 0.5
Difficult
d > 1.5
10
0.1, 0.2, 0.3, 0.4, 0.5 -0.1,- 0.2,- 0.3,0.4,- 0.5, 0.1, 0.2,
0.3, 0.4, 0.5
No change
Four calibration methods in this study
1. Anchor person ability (AP)
2. Anchor person ability and anchor 200
items difficulty out of 300 items (API)
3. Use of Displacement value from
Winsteps output
4. Item by Item calibration (IBI)
IBI: Item by Item calibration
• A vector of responses for an item
• A vector of ability who took the item
• Same concept as logistic regression,
but use Winsteps to calibrate
• No sparseness involved
• Less data is needed (especially when
not all items in a bank needed to be
checked)
Evaluation
• One sample t-test with alpha 0.01 for AP, API,
and IBI
• Cutoff value 0.4 for Displacement method
• Type I error rate
• Type II error rate
• Sensitivity (Type II + Sensitivity = 1)
• RMSE (average difference from banked value for flagged items)
• BIAS (average bias from banked value for flagged items)
Type I error rate
* Average over 40 replications
0.035
0.03
0.025
0.02
0.015
Control
0.01
Condition 1
0.005
Condition 2
0
• Type I error for Control is also inflated
• Condition 1 had higher Type I error rate
Type II error rate
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
•
•
* Average over 40 replications
Condition 1
Condition 2
Type II error for Displacement method is too high.
Condition 1 had higher Type II error rate
Sensitivity
0.8
* Average over 40 replications
0.7
0.6
0.5
0.4
0.3
Condition 1
0.2
Condition 2
0.1
0
• Sensitivity for Displacement method is too low.
• Condition 1 had lower sensitivity rate
Items with small sample sizes and small
drift are difficult to flag correctly.
Type II error were with items with small
sample size and/or small drift
Items with
large drift
Items with
small N
Item with
small drift
Same item
Same items
Same items
Which method has re-calibrated item
difficulty closer to the banked value?
• Median of the RMSE are similar across three methods
• IBI has less variance of RMSE than AP
Which method has less bias with the
re-calibrated item difficulty?
• All three methods has very small bias
• IBI has less variance of BIAS than AP
Conclusion
• Use caution with Displacement value to
identify item parameter drift.
• AP, API, and IBI worked reasonably well.
• Items with small drift or small sample sizes
are difficult to detect the item parameter drift
• Compared to AP, IBI had less variance of
RMSE and BIAS
• Item parameter in one direction (condition 1)
would cause more bias in the final ability
estimate, leading to higher Type I and Type II
errors.
Limitation and Future Study
• Proportion of items with item parameter drift was 10%
of the bank.
– How the results would change with various proportion?
How about the size of drift?
• Used only Rasch model
– How about other models and software?
• Minimum sample size was 10
– How about different minimum sample sizes (e.g., 30,50,
etc)?
• No iterative procedure (no update of the item difficulty
with drift)
– Does results get better if we do iteratively, updating the
difficulty after detecting?
Download