Presentation - Measured Progress

advertisement
How to Interpret Effect size in
CBT–PBT Comparability Studies
Presented
By
Leah Tepelunde Kaira
Dr. Nambury Raju Summer Internship Program
1
Order of Presentation
•
•
•
•
•
•
Introduction
Purpose of study
Review of Literature
Method
Results
Concluding remarks
Dr. Nambury Raju Summer Internship Program
2
Introduction
• Use of computerized testing has increased
over the past decade
– immediate scoring and reporting of results
– more flexible test administration schedules
– greater test administration efficiency
• Due to limited resources, education systems
provide both computer based (CBT) and
paper based (PBT) tests
Dr. Nambury Raju Summer Internship Program
3
Introduction continued
• Standards (AERA et. al, 1999) require a
“clear rationale and supporting evidence”
(Standard 4.10, p. 57) that scores obtained
from CBT and PBT can be used
interchangeably
• International Test Commission (ITC)
requires that testing agencies “provide
clear documented evidence of equivalence
…” (ITC, 2005. p21)
Dr. Nambury Raju Summer Internship Program
4
Introduction continued
• Although professional guidelines stipulate some
methods that could be employed to examine
comparability, they are silent with respect to how
to judge comparability
• The lack of criteria has resulted in educational
testing researchers using professional judgment or
guidelines employed in other fields
• Among the mostly used guidelines are those
suggested by Cohen (1988)
– Problem: May be misleading because in some areas
(e.g. education), small effect sizes are more likely
Dr. Nambury Raju Summer Internship Program
5
Purpose of study
•
Provide guidelines in interpreting effect sizes in
comparability studies
• Questions:
– How should effect sizes in comparability
studies be interpreted?
– Does size of score scale have an impact on
effect size?
– Does sample size have an impact on effect
size?
– Does magnitude of effect size depend on the
score distribution?
Dr. Nambury Raju Summer Internship Program
6
Related Literature
– Choi and Tinkler (2002) compared CBT and
PBT scores from math and reading for grades
3 and 10.
• compared item difficulty estimates and calculated
difference weighted by standard error
• Compared mean ability estimates across the
modes and grades to assess comparability.
• Reading items were coded based on their textual
focus to assess the relationship between textual
focus and item difficulty estimates.
Dr. Nambury Raju Summer Internship Program
7
Related literature continued
• More reading items were flagged compared to math.
• Higher mean differences in item difficulty estimates for
3rd graders than 10th graders, and larger mean
differences were observed in reading than in math.
• Within grade comparisons showed reading items for 3rd
grade became harder on a computer than on paper.
Such a difference was negligible at 10th grade.
• Mode effect was larger for reading that math
– It is noted that this study does not provide guidelines on how to
evaluate the size of effect. In addition, no empirical evidence is
provided for using an absolute d-value of 2 for flagging
differentially difficult items for the two administration modes.
Dr. Nambury Raju Summer Internship Program
8
Related literature continued
• Pearson (2007) evaluated comparability of online and paper field
tests
• Students were matched on reading, math, and writing scale score,
gender, ethnic group and field test form.
• A standardized difference (Zdiff) was calculated for both the theta
and difficulty parameter estimates.
• Cohen’s (1992) guidelines were used to interpret effect size.
• Standardized mean differences in theta were also small except in
one form where larger standardized mean differences and effect
sizes were observed for white, Hispanic, and students that indicated
‘other’ as their ethnicity. The observed effect sizes were small based
on Cohen’s guidelines
• Comparison of difficulty parameters resulted in flagging of 24 items
that had standardized mean differences of ±1.96. However, the
associated effect sizes for all flagged items were 0.20 or less
Dr. Nambury Raju Summer Internship Program
9
Related literature continued
• Kim and Huynh (2007) investigated
equivalence of scores from CBT and PBT
versions of Biology and Algebra end of
course exams.
• Results were analyzed by examining
differences in scale scores, item
parameters, ability estimates at the
content domain level
• An effect size measure (g) was used to
evaluate the differences. Cohen’s criteria
was used to judge the magnitude of g.
Dr. Nambury Raju Summer Internship Program
10
Related Literature continued
• Items were recalibrated and parameter estimates
were compared to parameters in the bank. Robust Z
and average absolute difference (AAD) statistics
were used to examine significant difference
• TCCs and TIFs of CBT and PBT were also
compared.
• Results showed small differences in scaled scores
as measured by the effect size. High correlations
were observed between recalibrated and bank item
parameters.
• The AAD statistic ranged from 0.29 to 0.37 with
small differences between CBT and PBT. TCCs and
TIFs for CBT and PBT were generally comparable in
both subjects.
Dr. Nambury Raju Summer Internship Program
11
Related Literature continued
• Criteria used in evaluating comparability
– Difference in mean scores
– Difference in item difficulty estimates
– Difference in ability parameter estimates
– Difference in TCCs and TIFs
Dr. Nambury Raju Summer Internship Program
12
Method
•
Study conditions
– 2 score scale sizes
– 4 score distributions
– 4 sample sizes
Dr. Nambury Raju Summer Internship Program
13
Method
• Procedure
a. Compute baseline TCC using operational item parameters
and theta values
b. Simulate performance of CBT learners on the test by
manipulating the item difficulty parameter such that the
maximum difference in expected score between CBT and
PBT groups is 0.1. Compute a TCC.
c. Repeat the procedure in (b) above to reflect maximum
differences in expected scores () of 0.2 to 3.00 in
increments of 0.1.
d. For each of the simulated TCCs, compute scaled scores
for various raw scores
e. Using the scaled scores computed in step d, compute
effect size between 2 TCCs.
Dr. Nambury Raju Summer Internship Program
14
Results
0.0069
0.0137
0.0206
0.0274
0.0342
0.0411
0.0479
0.0548
0.0616
0.0684
0.0753
0.0821
0.0889
0.0958
0.1026
0.1095
0.1163
0.1231
0.1300
0.1368
0.1437
0.1505
0.1573
0.1642
0.1701
0.1779
0.1845
0.1916
0.1984
0.2053
Dr. Nambury Raju Summer Internship Program
15
Results- Empirical distribution
0.45
0.4
Effect size
0.35
0.3
n=1
0.25
n=2
0.2
0.15
n=3
0.1
n=4
0.05
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
TCC
Dr. Nambury Raju Summer Internship Program
16
Results- Normal distribution
0.45
0.4
Effect size
0.35
0.3
0.25
n=1
0.2
n=2
0.15
n=3
0.1
n=4
0.05
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
TCC
Dr. Nambury Raju Summer Internship Program
17
Results – Negatively skewed
distribution
0.45
0.4
Effect size
0.35
0.3
n=1
0.25
n=2
0.2
0.15
n=3
0.1
n=4
0.05
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
TCC
Dr. Nambury Raju Summer Internship Program
18
Results- positively skewed
distribution
0.45
0.4
Effect size
0.35
0.3
0.25
n=1
0.2
n=2
0.15
n=3
0.1
n=4
0.05
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
TCC
Dr. Nambury Raju Summer Internship Program
19
Results- Summary
• Both sample size and score distribution
have an impact on effect size
• Better results obtained with roughly equal
sample sizes
• Larger effect sizes observed with skewed
distributions than empirical and normal
distributions
Dr. Nambury Raju Summer Internship Program
20
Concluding remark
• Researchers evaluating comparability of
CBT and PBT scores may need to be
more cautious in using Cohen’s guidelines
to judge comparability
Dr. Nambury Raju Summer Internship Program
21
Thank You !
• Suggestions and comments are
welcome!
Dr. Nambury Raju Summer Internship Program
22
Download