Assessment of Planning Ability: Psychometric Analyses on the Unidimensionality and Construct Validity of the Tower of London Task (TOL-F) Supplementary Materials S1 Predicting TOL-F Item Difficulties from Problem Structure (1PL Rasch Model) We applied the Linear Logistic Test model (LLTM) to explain the item difficulty parameters estimated in the 1PL Rasch model by the systematic manipulation of structural problem parameters in the construction of the problem set (i.e. ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’). We assumed differential effects for four-, five-, and six-move problems and the four different combinations of goal hierarchy and search depth (cf. Fig. 2 in main text). For the LLTM model, the fit to the data was worse (χ2[18] = 467.549, p < .001) than that of the 1PL Rasch model, indicating that the varied problem structure parameters did not provide an overall exhaustive explanation of the measured item difficulty parameters. Nonetheless, the item parameters predicted by the LLTM correlated by r = .922 with the item parameters measured in the 1PL Rasch model (see Fig. S1 A). The resulting LLTM estimations on the effects of problem structure, which reflect the influence of alternations in the problem structure on an item’s difficulty, are presented in Table S1. Figure S1. Predictions of item parameters by the Linear Logistic Test Model (LLTM) accounting for systematic manipulations of the structural problem parameters ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’. Panels depict the comparisons of the item parameters predicted by the LLTM with (A) the item parameters measured by the 1PL Rasch model as well as (B) with the empirically observed item difficulty of the individual TOL-F problems. Black dots represent individual problems, black lines and light gray patches denote linearly fitted regression lines and corresponding non-simultaneous 95% confidence bands, respectively. 2 Table S1. The Effects of Alternations of Problem Structure on Item Difficulty as Measured by the 1PL Rasch Model Minimum Moves Search Depth Goal Hierarchy four high unambiguous Effect on Item Difficulty 0.0001 low partly ambiguous 0.272 high partly ambiguous 0.549 low completely ambiguous 0.749 high unambiguous 1.910 low partly ambiguous 2.182 high partly ambiguous 2.458 low completely ambiguous 2.655 high unambiguous 2.828 low partly ambiguous 3.101 high partly ambiguous 3.377 low completely ambiguous 3.574 five six Note. 1 This combination of structural problem parameters serves as the baseline to which the other effects are compared. For example, changing the goal hierarchy from unambiguous to partly ambiguous (while keeping a minimum number of four moves and a high search depth) would lead to an increase in the expected item parameter of 0.549. In summary, the data structure modeled in the context of the 1PL Rasch model by the present LLTM can be regarded as an adequate approximation that accounts for 82% of the variation in the empirically observed data on item difficulty (r = -.905; Fig. S1 B). By comparison, a reduced LLTM model that only includes the commonly used ‘minimum number of moves’ as only predictor of problem difficulty also provides a worse fit than the 1PL Rasch model (χ2[21]= 1200.366, p < .001) and is outperformed by the LLTM including ‘minimum moves’, ‘search depth’, and ‘goal hierarchy’ for the prediction of the empirically observed item difficulty (χ2[3]= 265.268, p < .001). However, the structural parameter ‘minimum number of moves’ has the strongest influence of all parameters included in our model, as is indicated by a comparison of their effects on the estimated item difficulty (cf. Table S1). 3 S2 Testing the Stability of Estimated Item Parameters The stability of the item difficulties was assessed using the Mantel-Haenszel ² statistic. This method can be applied when two groups, which are usually named the reference and the focal group, worked on a set of items to test for the presence of differential item functioning. Following Dorans and Holland (1993), the Mantel-Haenzel ² statistic tests the null hypothesis that the odds of obtaining a correct answer to a specific item is the same in two distinct groups of respondents across all levels of a matching variable, e.g. the raw score. As was stated by Holland and Thayer (1988), the Mantel-Haenszel ² statistic can be considered as the uniformly most powerful unbiased test of this null hypothesis. It has been pointed out by some authors that it does not assume that the data fit a specific psychometric model (e.g. Magis, Béland, Tuerlinckx, & De Boeck, 2010). In the analysis of the TOL-F data, the presence of differential item functioning with respect to gender, age, and educational level was investigated. In this analysis, we used male respondents, respondents who were neither qualified for university entry nor had obtained an academic degree, and respondents below the age of 42 as reference group, respectively. The remaining respondents served as focal group. Table S2 presents the results of the calculation of the Mantel-Haenszel ² statistic with continuity correction for these three groups. All calculations were carried out using the difR software (Magis, Beland, & Raiche, 2013). 4 Table S2. Stability analyses with the Mantel-Haenszel ² statistic with continuity correction and the corresponding p-value for each each. p-Values below .05 are marked with an asterisk Item Age Sex #01 ² 0.5963 p #02 Education 0.4400 ² 0.7805 p p 0.3770 ² 0.3389 0.4846 0.4864 #03 0.7016 0.4022 4.7918 0.0286* 0.2451 0.6205 2.2626 0.1325 0.6087 0.4353 #04 0.0045 #05 1.0104 0.9466 2.1844 0.1394 0.6396 0.4238 0.3148 0.0275 0.8683 0.2206 0.6386 #06 4.5378 0.0332* 0.0179 0.8936 0.0098 0.9213 #07 0.0007 0.9786 0.2055 0.6503 0.7193 0.3964 #08 0.5420 0.4616 0.1760 0.6748 0.9608 0.3270 #09 3.3922 0.0655 0.0020 0.9639 0.9316 0.3344 #10 0.0332 0.8554 1.7828 0.1818 2.1235 0.1451 #11 9.3738 0.0022* 0.0324 0.8571 0.0512 0.8209 #12 0.1851 0.6670 0.7502 0.3864 0.3806 0.5373 #13 0.3248 0.5687 0.0012 0.9726 3.8202 0.0506 #14 0.0085 0.9267 0.8176 0.3659 0.0010 0.9748 #15 1.8024 0.1794 0.0233 0.8788 3.5057 0.0612 #16 1.4419 0.2298 0.2594 0.6105 0.4772 0.4897 #17 0.1580 0.6910 0.2072 0.6490 5.2475 0.0220* #18 1.3005 0.2541 1.0645 0.3022 0.0942 0.7589 #19 13.9957 0.0002* 0.7981 0.3717 2.0952 0.1478 #20 0.0058 0.9393 1.1240 0.2891 0.0402 0.8411 #21 2.5512 0.1102 0.1402 0.7080 0.0011 0.9739 #22 3.5938 0.0580 0.4070 0.5235 1.9202 0.1658 #23 6.4740 0.0109* 0.0354 0.8508 1.0017 0.3169 #24 6.3668 0.0116* 2.5774 0.1084 0.0164 0.8981 0.5605 Taken together, the results of the Mantel-Haenszel ² statistic indicate that the item difficulties remain stable for all focal and reference groups. The observed item difficulties can thus be considered as stable across various sample characteristics. The stability of the difficulty parameters in the 1PL or Rasch model was further assessed using Andersen Likelihood Ratio Tests (LRTs, Andersen, 1973). Furthermore, its fit was assessed using infit and outfit mean square statistics (Linacre, 2002). The LRTs relate the likelihood of the data for the item parameters estimated in the total sample to the likelihoods of the data for the item parameters estimated in predefined subsamples and thus compare the relative difficulty of the items across different subsamples. A non- 5 significant LRT thus indicates that the relative item difficulties remain stable across different populations of interest, as it is assumed in the 1PL model. While Andersen (1973) initially described the calculation of LRTs based on subsamples that differ with regard to their raw scores, current research also applies LRTs to test the invariance of the difficulty parameters across populations of interest as additional evidence for the fit of the 1PL model (for a theoretical justification and practical illustration, see Fischer, 2007). Statistical tests to assess the stability of the 1PL model across relevant sub-populations were calculated using the software package eRm (version 0.15-4; Mair, Hatzinger, & Maier, 2014) of the statistical software R (version 3.1.1; R Core Development Team, 2013). The stability of the item parameters obtained for the Rasch model was tested using Andersen Likelihood Ratio tests which used age, sex, education level, and test performance as split criteria. The results suggest that the item parameters are comparable for men and women (χ2 [23]= 23.76, p = .225) and across the two subsamples consisting of participants who were qualified for university entry or had obtained an university degree and of participants with a lower level of education (χ2 [23]= 31.60, p = .109). However, our results also imply that the item parameters differ between old and young participants (χ2 [23]= 64.53, p < .001) and between high- and low-performing participants (χ2 [23]= 51.92, p = .001). For both analyses, the median of the sample’s age (42 years) and TOL-F raw score (16) was used as split criterion. In order to explore the observed differences on the item level, additional z-tests according to Fischer and Scheiblechner (1970) were carried out. Between the two groups that were split with regard to their test performance, significant differences (|z| > 2.57, p < .01) were only found for one item (#16). Between the two age groups, significant differences were found for four items (#06, #11, #19, #24) (Table S3). 6 Table S3. Stability analyses for the 1PL parameter estimates. Item Item-Specific Test Statistics (z) Infit Outfit -0.42 0.94 0.86 1.00 1.31 0.95 0.98 -1.03 -1.08 0.96 0.96 1.52 -0.96 -0.44 0.98 0.99 1.13 0.28 0.58 -0.07 0.94 0.82 -2.57 0.06 0.07 1.13 0.99 1.14 #07 -0.09 0.54 0.79 -0.35 0.92 1.12 #08 -0.51 -0.55 -1.47 -1.13 0.92 0.87 #09 1.73 0.14 0.83 -1.83 0.95 0.93 #10 -0.52 -1.95 -1.39 1.64 1.05 1.06 #11 3.09 -0.32 0.31 -0.09 0.97 0.94 #12 -0.46 -1.18 0.65 0.06 0.96 0.96 #13 0.66 -0.23 2.14 1.27 0.96 1.11 #14 0.41 -1.19 -0.19 0.14 0.95 1.06 #15 -1.69 -0.30 -1.75 2.42 1.06 1.11 #16 -1.47 0.12 1.18 3.22 1.07 1.09 #17 0.97 0.42 -2.54 -1.95 0.95 0.90 #18 1.60 0.90 0.07 -2.35 0.92 0.90 #19 3.89 0.91 1.04 -1.15 0.96 0.95 #20 -0.16 -1.45 0.25 1.55 1.01 1.01 #21 -1.19 -0.45 -0.48 -1.64 0.93 0.91 #22 2.13 -0.93 1.57 0.23 0.99 0.95 #23 -2.01 0.33 -1.50 -1.70 0.92 0.88 #24 -2.84 -2.21 0.56 1.95 1.06 1.06 Age Sex Education Perfomance #01 -0.98 -0.99 -0.83 #02 0.62 2.27 #03 -0.99 1.74 #04 -0.22 #05 #06 Previous studies have also evaluated the differences in item parameter estimates between different subpopulations by calculating their maximum difference and by assessing whether the observed maximum difference was below the specific threshold of .5 (e.g. Smith et al., 2009). For old and young participants, the maximum difference in the item parameters was .617, with this difference being greater than .5 for only two items (#06, #19). For high- and low-performing participants, the maximum item parameter difference was .49. The differences of the estimates for the item parameters in the resulting eight subsamples are presented in Table S4. 7 Table S4. Item parameter estimates in the 1PL model for different subsamples for different levels of age, sex, education, and planning ability. Age Sex Education TOL-F Raw Score Item ≥ 42 years < 42 years Male Female High Low ≥16 <16 #01 -1.86 -2.12 -1.85 -2.11 -1.78 -2.05 -1.97 -2.12 #02 -2.38 -2.19 -2.77 -2.01 -2.64 -2.20 -2.36 -1.92 #03 -1.03 -1.22 -1.34 -0.98 -0.92 -1.17 -1.07 -1.35 #04 -0.70 -0.74 -0.88 -0.60 -0.54 -0.76 -0.70 -0.79 #05 -1.98 -1.69 -1.85 -1.78 -1.97 -1.77 -1.80 -1.83 #06 -1.23 -1.80 -1.52 -1.51 -1.53 -1.51 -1.58 -1.29 #07 -1.65 -1.67 -1.73 -1.60 -1.87 -1.61 -1.64 -1.74 #08 -1.82 -1.95 -1.81 -1.95 -1.53 -1.98 -1.82 -2.24 #09 0.20 0.46 0.33 0.35 0.20 0.36 0.44 0.14 #10 0.69 0.62 0.82 0.53 0.90 0.61 0.55 0.80 #11 0.94 1.41 1.20 1.15 1.11 1.18 1.17 1.16 #12 1.66 1.59 1.74 1.55 1.51 1.65 1.62 1.63 #13 -1.03 -0.90 -0.92 -0.97 -1.46 -0.86 -1.02 -0.76 #14 -0.75 -0.68 -0.59 -0.80 -0.67 -0.72 -0.72 -0.69 #15 0.31 0.05 0.21 0.16 0.48 0.12 0.04 0.42 #16 0.37 0.14 0.25 0.27 0.05 0.30 0.07 0.56 #17 1.32 1.47 1.37 1.43 1.91 1.32 1.56 1.25 #18 0.86 1.10 0.91 1.05 0.97 0.98 1.15 0.79 #19 1.16 1.78 1.38 1.52 1.27 1.49 1.55 1.37 #20 2.03 2.00 2.16 1.91 1.96 2.03 1.86 2.13 #21 0.79 0.62 0.75 0.68 0.79 0.69 0.81 0.56 #22 1.77 2.13 2.03 1.87 1.63 1.99 1.91 1.95 #23 1.22 0.92 1.05 1.10 1.34 1.02 1.20 0.93 #24 1.10 0.67 1.07 0.74 0.79 0.90 0.75 1.04 In order to assess the practical relevance of these partly significant differences, the person parameter estimates based on the different item parameter estimations in both age groups and the two performance groups were compared. In the TOL-F, the estimates for the person parameters based on the whole sample typically varied between values of -3 and 4. For young versus old participants, the maximum difference in the person parameter estimates amounted to .087, whereas for high- versus low-performing participants, the maximum difference in person parameter estimates amounted to .012. However, these differences are considered as practically insignificant given that they were considerably lower than the differences in person parameter estimations that would result from an increase of 1 point in the TOL-F raw score (corresponding to differences in person parameter estimations of .239 or higher). 8 Finally, infit and outfit mean square statistics revealed a reasonable fit to the 1PL Rasch model. Infit and outfit mean square statistics reflect how well the observed data are predicted by the model. Following Linacre (2002), infit mean square statistics are sensitive to unexpected responses to items for which the probability of a correct answer is not too extreme, while outfit statistics are sensitive to unexpected responses to items with extreme high or low difficulty. For both infit and outfit mean square statistics, values near 1 are desirable and values between .7 and 1.3 still indicate a reasonable fit to the Rasch model (e.g. Bond & Fox, 2001; Wright & Linacre, 1994). In detail, for the set of 24 four- to sixmove problems, infit mean square statistics ranged between .92 and 1.07 and the outfit mean square statistics ranged between .82 and 1.14 (Table S3). These statistics therefore lie in the range of 0.7 to 1.3, which is regarded as desirable in Rasch measurement (e.g. Bond & Fox, 2001; Wright & Linacre, 1994). Taken together, the results of the fit statistics and the model tests presented in this section indicated a reasonable fit to the 1PL model with the estimated item difficulty parameters closely reflecting the empirically observed item difficulties (r = -.992). Also in this framework, the obtained item difficulties can thus be considered as stable across various sample characteristics. 9