nps-NEU-2014-1607-20150810_TOL

advertisement
Assessment of Planning Ability:
Psychometric Analyses on the Unidimensionality and
Construct Validity of the Tower of London Task (TOL-F)
Supplementary Materials
S1
Predicting TOL-F Item Difficulties from Problem Structure (1PL Rasch Model)
We applied the Linear Logistic Test model (LLTM) to explain the item difficulty parameters estimated in
the 1PL Rasch model by the systematic manipulation of structural problem parameters in the construction
of the problem set (i.e. ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’). We assumed differential
effects for four-, five-, and six-move problems and the four different combinations of goal hierarchy and
search depth (cf. Fig. 2 in main text). For the LLTM model, the fit to the data was worse (χ2[18] =
467.549, p < .001) than that of the 1PL Rasch model, indicating that the varied problem structure
parameters did not provide an overall exhaustive explanation of the measured item difficulty parameters.
Nonetheless, the item parameters predicted by the LLTM correlated by r = .922 with the item parameters
measured in the 1PL Rasch model (see Fig. S1 A). The resulting LLTM estimations on the effects of
problem structure, which reflect the influence of alternations in the problem structure on an item’s
difficulty, are presented in Table S1.
Figure S1. Predictions of item parameters by the Linear Logistic Test Model (LLTM) accounting for systematic manipulations of
the structural problem parameters ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’. Panels depict the comparisons of the
item parameters predicted by the LLTM with (A) the item parameters measured by the 1PL Rasch model as well as (B) with the
empirically observed item difficulty of the individual TOL-F problems. Black dots represent individual problems, black lines and
light gray patches denote linearly fitted regression lines and corresponding non-simultaneous 95% confidence bands,
respectively.
2
Table S1. The Effects of Alternations of Problem Structure on Item
Difficulty as Measured by the 1PL Rasch Model
Minimum
Moves
Search
Depth
Goal Hierarchy
four
high
unambiguous
Effect on
Item
Difficulty
0.0001
low
partly ambiguous
0.272
high
partly ambiguous
0.549
low
completely ambiguous
0.749
high
unambiguous
1.910
low
partly ambiguous
2.182
high
partly ambiguous
2.458
low
completely ambiguous
2.655
high
unambiguous
2.828
low
partly ambiguous
3.101
high
partly ambiguous
3.377
low
completely ambiguous
3.574
five
six
Note.
1
This combination of structural problem parameters serves as the
baseline to which the other effects are compared. For example, changing
the goal hierarchy from unambiguous to partly ambiguous (while keeping
a minimum number of four moves and a high search depth) would lead to
an increase in the expected item parameter of 0.549.
In summary, the data structure modeled in the context of the 1PL Rasch model by the present
LLTM can be regarded as an adequate approximation that accounts for 82% of the variation in the
empirically observed data on item difficulty (r = -.905; Fig. S1 B). By comparison, a reduced LLTM
model that only includes the commonly used ‘minimum number of moves’ as only predictor of problem
difficulty also provides a worse fit than the 1PL Rasch model (χ2[21]= 1200.366, p < .001) and is
outperformed by the LLTM including ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’ for the
prediction of the empirically observed item difficulty (χ2[3]= 265.268, p < .001). However, the structural
parameter ‘minimum number of moves’ has the strongest influence of all parameters included in our
model, as is indicated by a comparison of their effects on the estimated item difficulty (cf. Table S1).
3
S2
Testing the Stability of Estimated Item Parameters
The stability of the item difficulties was assessed using the Mantel-Haenszel ² statistic. This method can
be applied when two groups, which are usually named the reference and the focal group, worked on a set
of items to test for the presence of differential item functioning. Following Dorans and Holland (1993),
the Mantel-Haenzel ² statistic tests the null hypothesis that the odds of obtaining a correct answer to a
specific item is the same in two distinct groups of respondents across all levels of a matching variable,
e.g. the raw score. As was stated by Holland and Thayer (1988), the Mantel-Haenszel ² statistic can be
considered as the uniformly most powerful unbiased test of this null hypothesis. It has been pointed out
by some authors that it does not assume that the data fit a specific psychometric model (e.g. Magis,
Béland, Tuerlinckx, & De Boeck, 2010). In the analysis of the TOL-F data, the presence of differential
item functioning with respect to gender, age, and educational level was investigated. In this analysis, we
used male respondents, respondents who were neither qualified for university entry nor had obtained an
academic degree, and respondents below the age of 42 as reference group, respectively. The remaining
respondents served as focal group. Table S2 presents the results of the calculation of the Mantel-Haenszel
² statistic with continuity correction for these three groups. All calculations were carried out using the
difR software (Magis, Beland, & Raiche, 2013).
4
Table S2. Stability analyses with the Mantel-Haenszel ² statistic with continuity correction and the
corresponding p-value for each each. p-Values below .05 are marked with an asterisk
Item
Age
Sex
#01
²
0.5963
p
#02
Education
0.4400
²
0.7805
p
p
0.3770
²
0.3389
0.4846
0.4864
#03
0.7016
0.4022
4.7918
0.0286*
0.2451
0.6205
2.2626
0.1325
0.6087
0.4353
#04
0.0045
#05
1.0104
0.9466
2.1844
0.1394
0.6396
0.4238
0.3148
0.0275
0.8683
0.2206
0.6386
#06
4.5378
0.0332*
0.0179
0.8936
0.0098
0.9213
#07
0.0007
0.9786
0.2055
0.6503
0.7193
0.3964
#08
0.5420
0.4616
0.1760
0.6748
0.9608
0.3270
#09
3.3922
0.0655
0.0020
0.9639
0.9316
0.3344
#10
0.0332
0.8554
1.7828
0.1818
2.1235
0.1451
#11
9.3738
0.0022*
0.0324
0.8571
0.0512
0.8209
#12
0.1851
0.6670
0.7502
0.3864
0.3806
0.5373
#13
0.3248
0.5687
0.0012
0.9726
3.8202
0.0506
#14
0.0085
0.9267
0.8176
0.3659
0.0010
0.9748
#15
1.8024
0.1794
0.0233
0.8788
3.5057
0.0612
#16
1.4419
0.2298
0.2594
0.6105
0.4772
0.4897
#17
0.1580
0.6910
0.2072
0.6490
5.2475
0.0220*
#18
1.3005
0.2541
1.0645
0.3022
0.0942
0.7589
#19
13.9957
0.0002*
0.7981
0.3717
2.0952
0.1478
#20
0.0058
0.9393
1.1240
0.2891
0.0402
0.8411
#21
2.5512
0.1102
0.1402
0.7080
0.0011
0.9739
#22
3.5938
0.0580
0.4070
0.5235
1.9202
0.1658
#23
6.4740
0.0109*
0.0354
0.8508
1.0017
0.3169
#24
6.3668
0.0116*
2.5774
0.1084
0.0164
0.8981
0.5605
Taken together, the results of the Mantel-Haenszel ² statistic indicate that the item difficulties remain
stable for all focal and reference groups. The observed item difficulties can thus be considered as stable
across various sample characteristics.
The stability of the difficulty parameters in the 1PL or Rasch model was further assessed using Andersen
Likelihood Ratio Tests (LRTs, Andersen, 1973). Furthermore, its fit was assessed using infit and outfit
mean square statistics (Linacre, 2002). The LRTs relate the likelihood of the data for the item parameters
estimated in the total sample to the likelihoods of the data for the item parameters estimated in predefined
subsamples and thus compare the relative difficulty of the items across different subsamples. A non-
5
significant LRT thus indicates that the relative item difficulties remain stable across different populations
of interest, as it is assumed in the 1PL model. While Andersen (1973) initially described the calculation of
LRTs based on subsamples that differ with regard to their raw scores, current research also applies LRTs
to test the invariance of the difficulty parameters across populations of interest as additional evidence for
the fit of the 1PL model (for a theoretical justification and practical illustration, see Fischer, 2007).
Statistical tests to assess the stability of the 1PL model across relevant sub-populations were
calculated using the software package eRm (version 0.15-4; Mair, Hatzinger, & Maier, 2014) of the
statistical software R (version 3.1.1; R Core Development Team, 2013). The stability of the item
parameters obtained for the Rasch model was tested using Andersen Likelihood Ratio tests which used
age, sex, education level, and test performance as split criteria. The results suggest that the item
parameters are comparable for men and women (χ2 [23]= 23.76, p = .225) and across the two subsamples
consisting of participants who were qualified for university entry or had obtained an university degree and
of participants with a lower level of education (χ2 [23]= 31.60, p = .109). However, our results also imply
that the item parameters differ between old and young participants (χ2 [23]= 64.53, p < .001) and between
high- and low-performing participants (χ2 [23]= 51.92, p = .001). For both analyses, the median of the
sample’s age (42 years) and TOL-F raw score (16) was used as split criterion. In order to explore the
observed differences on the item level, additional z-tests according to Fischer and Scheiblechner (1970)
were carried out. Between the two groups that were split with regard to their test performance, significant
differences (|z| > 2.57, p < .01) were only found for one item (#16). Between the two age groups,
significant differences were found for four items (#06, #11, #19, #24) (Table S3).
6
Table S3. Stability analyses for the 1PL parameter estimates.
Item
Item-Specific Test Statistics (z)
Infit
Outfit
-0.42
0.94
0.86
1.00
1.31
0.95
0.98
-1.03
-1.08
0.96
0.96
1.52
-0.96
-0.44
0.98
0.99
1.13
0.28
0.58
-0.07
0.94
0.82
-2.57
0.06
0.07
1.13
0.99
1.14
#07
-0.09
0.54
0.79
-0.35
0.92
1.12
#08
-0.51
-0.55
-1.47
-1.13
0.92
0.87
#09
1.73
0.14
0.83
-1.83
0.95
0.93
#10
-0.52
-1.95
-1.39
1.64
1.05
1.06
#11
3.09
-0.32
0.31
-0.09
0.97
0.94
#12
-0.46
-1.18
0.65
0.06
0.96
0.96
#13
0.66
-0.23
2.14
1.27
0.96
1.11
#14
0.41
-1.19
-0.19
0.14
0.95
1.06
#15
-1.69
-0.30
-1.75
2.42
1.06
1.11
#16
-1.47
0.12
1.18
3.22
1.07
1.09
#17
0.97
0.42
-2.54
-1.95
0.95
0.90
#18
1.60
0.90
0.07
-2.35
0.92
0.90
#19
3.89
0.91
1.04
-1.15
0.96
0.95
#20
-0.16
-1.45
0.25
1.55
1.01
1.01
#21
-1.19
-0.45
-0.48
-1.64
0.93
0.91
#22
2.13
-0.93
1.57
0.23
0.99
0.95
#23
-2.01
0.33
-1.50
-1.70
0.92
0.88
#24
-2.84
-2.21
0.56
1.95
1.06
1.06
Age
Sex
Education
Perfomance
#01
-0.98
-0.99
-0.83
#02
0.62
2.27
#03
-0.99
1.74
#04
-0.22
#05
#06
Previous studies have also evaluated the differences in item parameter estimates between different subpopulations by calculating their maximum difference and by assessing whether the observed maximum
difference was below the specific threshold of .5 (e.g. Smith et al., 2009). For old and young participants,
the maximum difference in the item parameters was .617, with this difference being greater than .5 for
only two items (#06, #19). For high- and low-performing participants, the maximum item parameter
difference was .49. The differences of the estimates for the item parameters in the resulting eight
subsamples are presented in Table S4.
7
Table S4. Item parameter estimates in the 1PL model for different subsamples for different levels of age, sex, education, and
planning ability.
Age
Sex
Education
TOL-F Raw Score
Item
≥ 42 years
< 42 years
Male
Female
High
Low
≥16
<16
#01
-1.86
-2.12
-1.85
-2.11
-1.78
-2.05
-1.97
-2.12
#02
-2.38
-2.19
-2.77
-2.01
-2.64
-2.20
-2.36
-1.92
#03
-1.03
-1.22
-1.34
-0.98
-0.92
-1.17
-1.07
-1.35
#04
-0.70
-0.74
-0.88
-0.60
-0.54
-0.76
-0.70
-0.79
#05
-1.98
-1.69
-1.85
-1.78
-1.97
-1.77
-1.80
-1.83
#06
-1.23
-1.80
-1.52
-1.51
-1.53
-1.51
-1.58
-1.29
#07
-1.65
-1.67
-1.73
-1.60
-1.87
-1.61
-1.64
-1.74
#08
-1.82
-1.95
-1.81
-1.95
-1.53
-1.98
-1.82
-2.24
#09
0.20
0.46
0.33
0.35
0.20
0.36
0.44
0.14
#10
0.69
0.62
0.82
0.53
0.90
0.61
0.55
0.80
#11
0.94
1.41
1.20
1.15
1.11
1.18
1.17
1.16
#12
1.66
1.59
1.74
1.55
1.51
1.65
1.62
1.63
#13
-1.03
-0.90
-0.92
-0.97
-1.46
-0.86
-1.02
-0.76
#14
-0.75
-0.68
-0.59
-0.80
-0.67
-0.72
-0.72
-0.69
#15
0.31
0.05
0.21
0.16
0.48
0.12
0.04
0.42
#16
0.37
0.14
0.25
0.27
0.05
0.30
0.07
0.56
#17
1.32
1.47
1.37
1.43
1.91
1.32
1.56
1.25
#18
0.86
1.10
0.91
1.05
0.97
0.98
1.15
0.79
#19
1.16
1.78
1.38
1.52
1.27
1.49
1.55
1.37
#20
2.03
2.00
2.16
1.91
1.96
2.03
1.86
2.13
#21
0.79
0.62
0.75
0.68
0.79
0.69
0.81
0.56
#22
1.77
2.13
2.03
1.87
1.63
1.99
1.91
1.95
#23
1.22
0.92
1.05
1.10
1.34
1.02
1.20
0.93
#24
1.10
0.67
1.07
0.74
0.79
0.90
0.75
1.04
In order to assess the practical relevance of these partly significant differences, the person parameter
estimates based on the different item parameter estimations in both age groups and the two performance
groups were compared. In the TOL-F, the estimates for the person parameters based on the whole sample
typically varied between values of -3 and 4. For young versus old participants, the maximum difference in
the person parameter estimates amounted to .087, whereas for high- versus low-performing participants,
the maximum difference in person parameter estimates amounted to .012. However, these differences are
considered as practically insignificant given that they were considerably lower than the differences in
person parameter estimations that would result from an increase of 1 point in the TOL-F raw score
(corresponding to differences in person parameter estimations of .239 or higher).
8
Finally, infit and outfit mean square statistics revealed a reasonable fit to the 1PL Rasch model.
Infit and outfit mean square statistics reflect how well the observed data are predicted by the model.
Following Linacre (2002), infit mean square statistics are sensitive to unexpected responses to items for
which the probability of a correct answer is not too extreme, while outfit statistics are sensitive to
unexpected responses to items with extreme high or low difficulty. For both infit and outfit mean square
statistics, values near 1 are desirable and values between .7 and 1.3 still indicate a reasonable fit to the
Rasch model (e.g. Bond & Fox, 2001; Wright & Linacre, 1994). In detail, for the set of 24 four- to sixmove problems, infit mean square statistics ranged between .92 and 1.07 and the outfit mean square
statistics ranged between .82 and 1.14 (Table S3). These statistics therefore lie in the range of 0.7 to 1.3,
which is regarded as desirable in Rasch measurement (e.g. Bond & Fox, 2001; Wright & Linacre, 1994).
Taken together, the results of the fit statistics and the model tests presented in this section indicated a
reasonable fit to the 1PL model with the estimated item difficulty parameters closely reflecting the
empirically observed item difficulties (r = -.992). Also in this framework, the obtained item difficulties
can thus be considered as stable across various sample characteristics.
9
Download