Inductive Reasoning 1 RUNNING HEAD: Cross-Cultural Equivalence of an Inductive Reasoning Test Inductive Reasoning in Zambia, Turkey, and The Netherlands: Establishing Cross-Cultural Equivalence Fons J. R. van de Vijver Tilburg University The Netherlands Mailing address: Fons J. R. van de Vijver Department of Psychology Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands Phone: +31 13 466 2528 Fax: +31 13 466 2370 E-mail: fons.vandevijver@kub.nl Acknowledgment. The help of Cigdem Kagitcibasi and Robert Serpell in making the data collection possible in Turkey and Zambia is gratefully acknowledged. Inductive Reasoning 2 Abstract Tasks of inductive reasoning and its component processes were administered to 704 Zambian, 877 Turkish, and 632 Dutch pupils from the highest two grades of primary and the lowest two grades of secondary school. All items were constructed using item-generating rules. Three types of equivalence were examined: structural equivalence (Does an instrument measure the same psychological concept in each country?), measurement unit equivalence (Do the scales have the same metric in each country?), and full score equivalence (full comparability of scores across countries). Structural and measurement unit equivalence were examined in two ways. First, a MIMIC (multiple indicators, multiple causes) structural equation model was fitted, with tasks for component processes as input and inductive reasoning tasks as output. Second, using a linear logistic model, the relationship between item difficulties and the difficulties of their constituent item-generating rules was examined in each country. Both analyses of equivalence provided strong evidence for structural equivalence, but only partial evidence for measurement unit equivalence; full score equivalence was not supported. Inductive Reasoning 3 Equivalence of a Measure of Inductive Reasoning in Zambia, Turkey, and The Netherlands Inductive reasoning has been a topic of considerable interest to cross-cultural researchers, mainly because of its strong relationship with general intelligence (Carroll, 1993; Gustafsson, 1984; Jensen, 1980). Many cultural populations have been studied using common tasks of inductive reasoning such as number series extrapolations (e.g., How should the following series be continued: 1, 4, 9, 16,...?), figure series extrapolations such as Raven’s Progressive Matrices, analogical reasoning (e.g., Complete the following: day : night :: white : ...?), and exclusion tasks (e.g., Mark the odd one out: (a) 21, (b) 14, (c) 28, (d) 63, (e) 32). Studies of inductive reasoning among nonwestern populations were reviewed by Irvine (1969, 1979; Irvine & Berry, 1988). He concluded that the structure found among western participants with exploratory factor-analytic techniques is usually replicated. More recent comparative studies, often based on comparisons of ethnic groups in the U.S.A., have confirmed this conclusion (e.g., Fan, Willson, & Reynolds, 1995; Geary & Whitworth, 1988; Hakstian & Vandenberg, 1979; Hennessy & Merrifield, 1976; Naglieri & Jensen, 1987; Ree & Carretta, 1995; Reschly, 1978; Sandoval, 1982; Sung & Dawis, 1981; Taylor & Ziegler, 1987; Valencia & Rankin, 1986; Valencia, Rankin, & Oakland, 1997). Major differences in structure (for instance as reported by Claassen & Cudeck, 1985) are exceptional. Inductive reasoning provides a strong case for what Waitz, a nineteenth century philosopher, called “the psychic unity of mankind” (Jahoda & Krewer, 1997), according to which the basic structure and operations of the cognitive system are universal while manifestations of these Inductive Reasoning 4 structures may vary across cultures, depending on what is relevant in a particular cultural context. The validity of cross-cultural comparisons can be jeopardized by bias; examples of bias sources are country differences in stimulus familiarity (Serpell, 1979) and item translations (Ellis, 1990; Ellis, Becker, & Kimmel, 1993). Bias refers to the presence of score differences that do not reflect differences in the target construct. Much research has been reported on fair test use; the question is addressed there whether a test predicts an external criterion such as job success equally well in different ethnic, age or gender groups (e.g., Hunter, Schmidt, & Hunter, 1979). The present study does not study bias in test use but bias in test meaning; in other words, no reference is made here to social bias, unfairness, and differential predictive validity. The present study focuses on the question whether the same score but obtained in different cultural groups has the same meaning across these groups. Such scores are unbiased. Two types of approaches have been developed to deal with bias in cognitive tests. The first type, known under various labels such as culture-free, culture-fair, and culture-reduced testing (Jensen, 1980), attempts to eliminate or minimize the differential influence of cultural factors, like education, by adapting instrument features that may induce unwanted score differences across countries. Raven's Matrices Tests are often considered to exemplify this approach (e.g., Jensen, 1980). Despite the obvious importance of good test design, the approach has come under critical scrutiny; it has been argued that culture and test performance are so inextricably linked that a culture-free test does not exist (Frijda & Jahoda, 1966; Greenfield, 1997). Inductive Reasoning 5 Second, various statistical procedures have been proposed to examine the appropriateness of psychological instruments in different ethnic groups. Examples are exploratory factor analysis followed by target rotations and the computation of factorial agreement between ethnic groups (Barrett, Petrides, Eysenck, & Eysenck, 1998; McCrae & Costa, 1997), simultaneous components analysis (Zuckerman, Kuhlman, Thornquist, & Kiers, 1991), item bias statistics (Holland & Wainer, 1993), and structural equation modeling (Little, 1997). It is remarkable that a priori and a posteriori approaches (test adaptations and statistical techniques, respectively) have almost never been combined, despite their common aim, mutual relevance, and complementarity. The present paper attempts to integrate a priori and a posteriori approaches and takes equivalence as a starting point. Equivalence refers to the similarity of psychological meaning across cultural groups (i.e., the absence of bias). Three hierarchical types of equivalence can be envisaged (Van de Vijver & Leung, 1997a, b). At the lowest level the issue of similarity of a psychological construct, as measured by a test in different cultures, is addressed. An instrument shows structural (also called functional) equivalence if it measures the same construct in each cultural population studied. There is no claim that scores or measurement units are comparable across cultures. In fact, instruments may be different across cultures; structural equivalence is supported if it can be shown that in each culture the same underlying construct (e.g., inductive reasoning) has been measured. The intermediate level refers to measurement unit equivalence, defined by equal scale units and unequal scale origins across cultural groups (e.g., the temperature scales in degrees of Celsius and Kelvin). In practical terms, this type of equivalence is Inductive Reasoning 6 found when the same instrument has been administered in different groups but scores are not directly comparable across groups because of the presence of moderating variables with a bearing on group mean scores, such as intergroup differences in stimulus familiarity. Structural equation modeling is suitable to address measurement unit equivalence because it allows for a comparison of score metrics across cultural groups. The third and highest level is called full score equivalence and refers to identity of both scale units and origins. Only in the latter case, scores can be compared both within and across cultures using techniques like t tests and analyses of (co)variance. Full score equivalence assumes the complete absence of bias in the measurement. Score differences between and within cultures are entirely due to inductive reasoning. There are no fully adequate statistical tests of full score equivalence, but some go a long way. The first is indirect and involves the use of additional variables to (dis)confirm a particular interpretation of cross-cultural score differences (Poortinga & Van de Vijver, 1987). Suppose that Raven’s Standard Progressive Matrices Test is administered to adults in the U.S.A. and to illiterate Bushmen. It may well be that the test provides a good picture of inductive reasoning in both cultures. However, it is likely that differences between the countries are influenced by educational differences between the groups. Score differences within and across groups have a different meaning in this case. A measure of testwiseness or previous test exposure, administered to all participants, can be used to (dis)confirm that cross-cultural score differences are due to bias. Full score equivalence is then not demonstrated but assumed, and corollaries are tested. Inductive Reasoning 7 Other tests of full score equivalence that have been proposed, compare the patterning of cross-cultural score differences across items or subtests, often within the framework of structural equation modeling. An example is multilevel covariance structure analysis (Muthén, 1991, 1994) that compares the factor structure of pooled within-country data to between-country data. Such an analysis assumes a sizable number of cultural groups involved. Another example involves the modeling of latent means in a structural model (e.g., Little, 1997). A frequently employed approach, often based on item response theory, which is applicable when a small number of cultures have been studied, is the examination of differential item functioning or item bias (e.g., Holland & Wainer, 1993; Van der Linden & Hambleton, 1997). As long as the sources of bias (such as education) affect all items in a more or less uniform way, no statistical techniques will indicate that between-group differences are of a different nature than within-group differences. Only if bias affects some items, the proposed techniques can identify it. In sum, the establishment of full score equivalence is an intricate issue. In many empirical studies dealing with mental tests, this form of equivalence is merely assumed. As a consequence, statements about the size of cross-cultural score differences often have an unknown validity. Sternberg and Kaufman’s (1998) observation that we know that there are population differences in human abilities, but that their nature is elusive, is very pertinent. In line with current thinking in validity theory (Embretson, 1983; Messick, 1988), the present study combines test design and statistical analyses to deal with bias (and equivalence). A distinction is made between internal and external procedures to establish equivalence, depending on whether the procedure is based Inductive Reasoning 8 on information derived from the scrutinized test itself (internal) or from additional tests (external). The present study examines the structural, measurement unit, and full score equivalence of a measure of inductive reasoning in three, culturally widely divergent populations (Zambia, Turkey, and the Netherlands). Structural and measurement unit equivalence are studied using both an internal and external procedure. The internal procedure to examine equivalence is based on item-generating rules that underlie the instruments. In the external procedure, equivalence is scrutinized by comparing the contribution of skill components to inductive reasoning across countries. Three components are presumably relevant in the types of inductive reasoning tasks studied here (Sternberg, 1977). The first is classification: treating stimuli as exemplars of higher order concepts (e.g., the set CDEF as four consecutive letters in the alphabet, as an instance of a group with one vowel, as a group with three consonants, etcetera). Individuals are more successful in inductive reasoning tasks when they can generate more of these classifications. Therefore, in addition to classification, the skill to generate underlying rules on the basis of a stimulus set was also tested. Finally, each generated rule has to be tested (e.g., Do other groups also have four consecutive letters?). The latter skill, labeled rule testing, was also assessed. Inductive Reasoning 9 Method Participants An important consideration in the choice of countries was the presumed strong influence of schooling on test performance (Van de Vijver, 1997); the expenditure per head on education, a proxy for school quality, is strongly influenced by national affluence. Countries with considerable differences in school systems and educational expenditures per child were chosen. Furthermore, inclusion of at least three different cultural groups decreases the number of alternative hypotheses to explain cross-cultural differences (Campbell & Naroll, 1972). Zambia, Turkey, and the Netherlands show considerable differences in educational systems and GDP (per capita); the GDP figures per capita for 1995 were US$ 382, 2,814, and 25,635 for the three countries, respectively. School life expectancy of the three countries is 7.3, 9.7, and 15.5 year (United Nations, 1999). The choice of Zambia was also made because of its lingua franca in school; English is the school language in Zambia which was convenient for developing and administering tasks. In each country pupils of four subsequent grades were involved. In the Netherlands these were the last two grades of primary school (Grade 5 and 6) and the first two grades of secondary school. The same procedure was applied in Zambia, where primary school has seven grades. In a pilot study it was found that the tasks could not be adequately administered to pupils from Grade 5 because most of these children have still an insufficient knowledge of English, that is the first language of few Zambians. Children start attending primary school in Turkey and the Netherlands at the age of six, while schooling starts one year later in Zambia; as a consequence, the Zambian pupils were on average two years older. The Zambian Inductive Reasoning 10 sample comprised of more than 20 cultural groups (the three largest being Tonga, 21%; Bemba, 13%; and Nyanja, 11%); the Turkish groups was 99% Turkish, while in the Dutch group 93% were Dutch, 2% Moroccan, and 2% Turkish. Primary schooling in Turkey has five grades; pupils from the fifth grade of primary school and the first three grades of secondary school were involved. Secondary education is markedly different in the three countries. In Zambia a nationwide examination (with tests for reasoning and school achievement) at the end of the last grade of primary school, Grade 7, is utilized to select pupils for secondary school. After the seventh Grade less than 20% pupils continue their education in either public or private secondary schools. Admittance to public schools is conditional on the score at the Grade 7 Examination. Cutoff scores vary per region and depend on the number of places available in secondary schools. In urban areas there are some private schools; admittance to these schools usually does not depend on examination results, but is mainly dependent on availability of places as well as the ability and willingness of parents to pay school fees. Participants both from public and private schools were included in our study. The tremendous dropout at the end of Grade VII has undoubtedly adversely affects the generalizability of the data to the Zambian population at large and it also jeopardized the comparability of the age cohorts, both within Zambia and across the three countries. In Turkey and the Netherlands secondary schooling is more or less intellectually streamed. An attempt was made to retain the intellectual heterogeneity of the primary school group at secondary school level by selecting various types of schools. The intellectual heterogeneity of the samples is clearly larger in Turkey and the Netherlands than Inductive Reasoning 11 Zambia; yet, none of the samples may be fully representative for the age groups of their respective countries. Insert Table 1 about here Sample sizes are presented in Table 1; of the participants recruited 56% came form urban and 44% from rural schools; 46% was female, 54% was male. Instruments The battery consisted of eight tasks, four with figures and four with letters as stimuli. Each of these two stimulus modes had the same composition: a task of inductive reasoning and three tasks of skill components that are assumed to constitute important aspects of inductive reasoning. The first is rule classification, called encoding in Sternberg’s (1977) model of analogical reasoning. The second is rule generating, a combination of inference and mapping. The third is rule testing, a combination of comparing and justification. All tasks are based on item-generating rules, schematically presented in Appendix A. All figure tasks are based on the following three item-generating rules: (a) The same number of figure elements is added to subsequent figures in a period (periods consist of either circles or squares, but never of both. A period defines the number of figures that belong together. Examples of items of all tasks, in which the item-generating rules are illustrated, can be found in Appendix B). (b) The same number of elements is subtracted from subsequent figures in a period. Inductive Reasoning 12 (c) The same number of elements is, alternatingly, added to and subtracted from subsequent figures in a period. The three item-generating rules are an example of a facet, a generic term for all item features that are systematically varied across items. Two more facets applied to all figure tasks. First, the number of figures in a period varies from two to four. Second, the number of elements that are added to or subtracted from successive elements of a period varied from one to three. Whenever possible, all facet levels were crossed. However, for some combinations of facet levels no item could be generated. For example, as each figure can have (in addition to a circle or a square that are present in all items) only five elements (namely a hat, arrow, dot, line, or bow), it is impossible to construct an item with two or three elements added to each of four figures in a period. Inductive Reasoning Figures is a task of 30 items. Each item has five rows of 12 figures, the first eight are identical. One of the rows has been composed according to a rule while in the other rows the rule has not been applied consistently. The pupil has to mark the correct row. Besides the common facets, two additional facets were used to generate the items of Inductive Reasoning Figures. First, the figure elements added or subtracted are either the same or different across periods. In the example of Appendix B there is a constant variation because in each period a dot is added first, followed by a dash and a hat. Second, periods do or do not repeat one another, meaning that the first figures of each period are identical (except for a possible swap of circle and square). Inductive Reasoning 13 The 36 items of Rule Classification Figures consist of eight figures. Below these figures the three item-generating rules were printed. In addition, the alternative "None of the rules applies" has been added. The pupil had to indicate which of the four alternatives applies to the eight figures above. In addition to the common facets, the task has three additional facets. The first two are the same as in Inductive Reasoning Figures. The third refers to the presence or absence of periodicity cues. These cues refer to the presence of both circles and squares in an item (as illustrated in the first item of Appendix B) or the presence of either squares or circles (if all circles of the example would be changed into squares, no periodicity cues would be present). Whereas in Inductive Reasoning Figures the number of different elements of a figure could be either one, two, or three, Rule Classification Figures has another level of this facet, referring to a variable number of elements. For example, in the first period one element is added to subsequent figures and in the second period two elements. Each of the 36 items of Rule Generating Figures consists of a set of six figures under which three lines with the numbers 1 to 6 are printed. In each item one, two, or three triplets (i.e., groups of three figures) have been composed according to one of the item-generating rules. Any of the six figures of an item can be part of one, two, or three triplets. Pupils were asked to indicate all triplets that constitute valid periods of figures. No information about the number of valid triplets in each particular item was given. The total number of triplets was 63. In the data analysis these were treated as separate, dichotomously scored items. Inductive Reasoning 14 Two facets, in addition to the common ones, were included. First, periodicity cues are either present or absent; the facet has the same meaning as in Rule Classification Figures. Second, the number of valid triplets is one, two, or three. A verbal specification is given at the top of each item of Rule Testing Figures. In this specification three characteristics of the item are given, namely the periodicity, the item-generating rule, and the number of elements varied between subsequent figures of a period (e.g., "There are 4 figures in a group. 1 thing is subtracted from figures which come after each other in a group"). Below this specification four rows of eight figures have been drawn. One of the rows of eight figures has been composed completely according to the specification. In some items none of the four rows has been composed according to the specification. In this case a fifth response alternative, "None of the rows has been composed according to the specification" applies. This facet is labeled “None/one of the rules applies”. The pupil has to mark the correct answer. The facets and facet levels of Rule Testing Figures and Rule Classification Figures were identical. In addition, the facet “Rows (do not) repeat each other” is included in the former task. In some items the rows are fairly similar to each other (except for minor variations that were essential for the solution), while in other items each row has a completely different set of eight figures. The letter tasks were based on five item-generating rules: (a) Each group of letters has the same number of vowels. The vowels used in the task are A, E, I, O, and U. As the status of the letter Y can easily create confusion in English and Dutch where it can be both a consonant Inductive Reasoning 15 and a vowel, the letter was never used in connection to the first itemgenerating rule; (b) Each group of letters has an equal number of identical letters that are the same across groups (e.g., BBBB BBBB); (c) Each group of letters has an equal number of identical letters that are not the same across groups (e.g., GGGG LLLL); (d) Each group of letters has a number of letters that appear the same (i.e., 1, 2, 3, or 4) number of positions after each other in the alphabet. The letters A and B have a difference of one position, the letters A and C a difference of two positions, etcetera; (e) Each group of letters has a number of letters that appear the same (i.e., 1, 2, 3, or 4) number of positions before each other in the alphabet. A second facet refers to the number of letters to which the rule applies. The number could vary from 1 to 6. All items of the letter tasks are based on a combination of the two facets described (i.e., item rule and number of letters). Like in the figure tasks, not all combinations of the facets are possible; for example, applications of the fourth and fifth rule assume an item rule that is based on at least two letters in a group. Inductive Reasoning Letters bears resemblance to the Letter Sets Test in the ETS-Kit of Factor-Referenced Tests (Ekstrom, French, & Harman, 1976). Each of the 45 items consists of five groups of six letters. Four out of these five groups are based on the same combination of the two facets (e.g., they all have two vowels). The pupil has to mark the odd one out. Inductive Reasoning 16 Each of the 36 items of Rule Classification Letters consists of three groups of six letters. Most items have been constructed according to some combination of the two facets, while in some items the combination has not been used consistently. The five item-generating rules are printed under the groups of letters. The pupil has to indicate which item-generating rule underlies the item. If the rule has not been applied consistently, the sixth response alternative, "None of the rules applies," is the correct answer. Like in Rule Classification Figures, the facet “Non/one of the rules applies” is included. Each of the 30 items of Rule Generating Letters consists of a set of 6 groups of letters (each made up of 1 to 9 letters). In each item up to five triplets of (subsets of) the groups of letters have been composed according to a combination of the two facets. The pupil is asked to indicate as many triplets as possible. Again, the pupils were not informed about the exact number of triplets in each item. The total number of triplets is 90, which are treated as separate items in the data analysis. Like in Rule Generating figures, a facet about the number of valid triplets (ranging here from 1 to 5) applies to the items, in addition to the common facets. Rule Testing Letters consists of 36 items. Each item starts with a verbal specification. In the specification two characteristics of the item are given, namely the item-generating rule and the number of letters pertaining to the rule (e.g., "In each group of letters there are 3 vowels"). Below this specification four rows of three groups of six letters are printed. In most items one of the rows figures has been composed according to the specification. The pupil has to find this row. In some items none of the four rows has been composed according to the specification. In this case a fifth response alternative, "None of the rows has been composed Inductive Reasoning 17 according to the specification above the item," applies. The pupil has to indicate which of the five alternatives applies. The task had three facets, besides the common ones. Like in Rule Classification Letters, there is a facet indicating whether or not one of the alternatives follows the rule. Also, like in Rule Testing Figures, there is a facet indicating whether or not rows repeat each other. The Turkish and Roman alphabet are not entirely identical. The (Roman) letters Q, W, and X were not used here since they are uncommon in Turkish. The presence of specifically Turkish letters, such as Ç, Ö, and Ü, necessitated the introduction of small changes in the stimulus material (e.g., the sequence ABCD in the Zambian and Dutch stimulus materials was changed into ABCÇ in Turkish). Administration The tasks were administered without time limit to all pupils of a class; however, in the rural areas in Zambia the number of desks available was often insufficient for all pupils to work simultaneously as each pupil had to have his or her own test booklet and answer sheet. The experimenter then selected randomly a number of participants. The tasks were administered by local testers. The number of testers in Zambia, Turkey, and the Netherlands were two, three, and two (the author being one of them), respectively. Five were psychologists and three were experienced psychological assistants. All testers followed a one-day training in the administration of the tasks. Inductive Reasoning 18 In Zambia English was used in the administration. A supplementary sheet in Nyanja, the main language of the Lusaka region, was included in the test booklet that explained the item-generating rules. Turkish was the testing language in the Turkish group Turkish and Dutch in the Dutch group. The administration of all tasks to all pupils would presumably have taken three school days. In order to avoid the loss of motivation and test fatigue, two experimental conditions were introduced: the figure and the letter condition. The two tasks of inductive reasoning were always administered; in the figure condition rule classification, generating, and rule testing tasks with figures were also included, while the three additional letter tasks were administered in the letter condition; sample sizes for each condition are given in Table 1. So, all pupils received five tasks: two tasks of inductive reasoning and three tasks of skill components (either the three figure or the three letter skill component tasks). The administration of the five tasks took place on two consecutive school days. The order of administration of the tasks was random, with the constraint that the two tasks of inductive reasoning were given on one day (either the first or the second testing day) and the remaining on the other one. The description of all eight instruments started with a one-page description of the task, which was read aloud by the experimenter to the pupils; item-generating rules of the stimulus mode were specified. This instruction was included in the pupils’ test booklets. Examples were then presented of each of the item-generating rules; explicit reference was made to which rule applied. Finally, the pupils were asked to answer a number of exercises that again, covered all item-generating rules. After this instruction, the pupils were asked to answer the actual items. In each Inductive Reasoning 19 figure task the serial position of each figure was printed on top of the item in order to minimize the computational load of the task. The alphabet was printed at the top of each page of the letter tasks, with the vowels underlined. It was indicated to the pupils that they were allowed to look back at the instructions and examples (e.g., to consult the item-generating rules). Experience showed that this was infrequently done, probably because all tasks of a single stimulus mode utilized the same rules. Results The section begins with a description of preliminary analyses, followed by the main analyses. Per analysis, the hypothesis, statistical procedure and findings are reported. Preliminary Analyses The internal consistencies of the instruments (Cronbach’s alpha) were computed per culture and grade. Inductive Reasoning Figures showed an average of .86 (range: .79-.93), Rule Classification Figures .83 (.71-.90), Rule Generating Figures .89 (.84-.95), Rule Testing Figures .85 (.81-.89), Inductive Reasoning Letters .79 (.69-.88), Rule Classification Letters .83 (.73-.90), Rule Generating Letters .93 (.90-.95), and Rule Testing Letters .78 (.63-.85). Overall, the internal consistencies yielded adequate values. Country differences were examined in a procedure described by Hakstian and Whalen (1976). Data of all grades were combined. The M statistic, that follows a chi square distribution with two degrees of freedom, was significant for Inductive Reasoning Figures (M = 64.92, p < .001), Rule Classification Figures (M = 10.57, p < .01), Rule Generating Figures (M = 34.06, p < .001), Rule Classification Letters (M = 11.57, p < .01), and Rule Testing Letters (M = 12.40, p < Inductive Reasoning 20 .01). The Dutch group tended to have lower internal consistencies (a possible explanation is given later). Insert Table 2 and 3 about here The average proportions of correctly solved items per country, grade, and task are given in Table 2. Differences in average scores were tested in a multivariate analysis of variance with country (3 levels; Zambia, Turkey, and the Netherlands), grade (4 levels; 5 through 8), and gender (2 levels). Separate analyses were carried out for the letter and figure mode. It may be noted that the present analysis is presented merely for exploratory purposes to give insight in the relative contribution of each factor to the overall score variation; conclusions about country differences in inductive reasoning or its components are premature until full score equivalence of scores across countries has been shown. Table 3 gives the estimated effect sizes (proportion of variance accounted for). The results were essentially similar for the two modes. Country was highly significant (p < .001) in all tasks, usually explaining more than 10%. Zambian pupils tended to show the lowest scores and Dutch pupils the highest scores. Grade differences were as expected; as can be confirmed in Table 2, scores increased with grade. The effect sizes were substantial, usually larger than 10%, and highly significant for all tasks (p < .001). Gender differences were small; significant differences were found for Rule Testing Figures and Inductive Thinking Letters (girls scored higher on both tasks), but gender differences did not explain more than 1% on any task. The country by grade interaction was significant in all analyses, explaining between 1 and 5%. As can be seen in Table 2, score Inductive Reasoning 21 increases with grade tended to be smaller in the Netherlands than in the other two countries. Country differences in scores were large in all grades but tended to be become smaller with age. These results are in line with a meta-analysis (Van de Vijver, 1997) in which in the age range examined here, there was no increase of cross-cultural score differences with age (contrary to what would be predicted from Jensen’s, 1977, cumulative deficit hypothesis). Other interactions were usually smaller and often not significant. Structural Equivalence in Internal Procedure Hypothesis. The first hypothesis addresses equivalence in internal procedures by examining the decomposition of the item difficulties. The hypothesis states that facet levels provide an adequate decomposition of the item difficulties of each task in each country (Hypothesis 1a). See Table 4 for an overview of the hypotheses and their tests. Statistical procedure. Structural equivalence of all tests is examined using the linear logistic model (LLM) (Fischer, 1974, 1995). It is an extension of the Rasch model, which is one of the frequently employed models in item response theory. The Rasch model holds that the probability that a subject k (k = 1, …, K) responds correctly to item i is given by exp(k - i)/[1+ exp(k - i)], (1) in which k represents the person’s ability and i the item difficulty. An item is represented by only one parameter, namely its difficulty (unlike some other models Inductive Reasoning 22 in item response theory in which each item also has a discrimination parameter, sometimes in addition to a pseudo-guessing parameter). A sufficient statistic for estimating a person’s ability is the total number of correctly solved items on the task. Analogously, the number of correct responses at an item provides a sufficient statistic for estimating the item difficulty. For our present purposes the main interest is in item parameters. The LLM imposes a constraint on the item parameter by specifying that the item difficulty is the sum of an intercept (that is irrelevant here) and a sum of underlying facet level difficulties, j: i = + qij j (2) The second step aims at estimating the facet level difficulties (). Suppose that the item is “BBBBNM BBBBKJ BBBBHJ BBFTHG BBBBHN”. In terms of the facets, the item can be classified as involving (a) four letters (facet: number of letters); (b) equal letters within and across groups of letters (facet: item rule). The above model equation (2) specifies that the item parameter will be the sum of an intercept, two facet level difficulties (namely the difficulty parameter of items dealing with four letters and the difficulty parameter of items dealing with equal letters within and across groups of letters), and a residual component. The matrix Q (with elements qij) defines the independent variable; the matrix has m rows (the number of items of the task) and n columns (the number of independent facet levels of the task). Entries of the Q matrix are zero or one depending on whether the facet level is absent or present in the item (interactions of Inductive Reasoning 23 facets were not examined). In order to guarantee uniqueness of the parameter estimates in the LLM, linear dependencies in the design matrix were removed by leaving the first level of each facet out of the design matrix. This (arbitrary) choice implied that the first level of each facet has a difficulty level of zero and that the size and significance of other facet levels should be interpreted relative to this “anchor.” The sufficient statistic for estimating the basic parameters is the number of correct answers at the items that make up the facet level. As a consequence, there will be a perfect rank order between this number of correct answers and j. Various procedures have been developed to estimate the basic parameters. In the present study conditional maximum likelihood estimation was used (details of this computationally rather involved procedure are given by Fischer, 1974, 1995). An important property of the LLM is the sample independence of its parameters; estimates of the item difficulty and the basic parameters are not influenced by the overall ability level of the pupils. This property is attractive here because it allows for their estimation, even when average scores of cultural groups differ. The LLM is a two-step procedure; the first consists of a Rasch analysis. Estimates of item () and person () parameters of equation (1) are obtained. In the second step the parameters of equation (2) are estimated. The item parameters are used in the evaluation of the fit of the model. The fit of an LLM can be evaluated in various ways. First, a likelihood ratio test can be computed, comparing the likelihood of the (unrestricted) Rasch model to the (restricted) LLM. The statistic is of limited value here. The ratio is affected by guessing (Van de Vijver, 1986). Because all tasks employed a multiple-choice format, it is unrealistic to assume that a Rasch model would hold. The usage of an LLM may seem questionable here because of Inductive Reasoning 24 the occurrence of guessing (pupils were instructed to answer all items). However, Van de Vijver (1986) has shown that guessing gives rise to a reduction of the variance of the estimated person and item parameters but correlations of both estimated parameters with their true values are hardly affected. A useful heuristic to evaluate the degree of fit of the LLM is provided by the correlation between the Rasch parameters of the first step of the analysis and the by means of the design matrix reconstructed item parameters of the second step. It amounts to correlating item parameters of the first step () (the “unfaceted item difficulties”) with the item parameters of the second step, using i* = qij j (the “faceted item difficulties”). The latter vector gives the item parameters estimated on the basis of the estimated facet level difficulties. Higher correlations point to a better approximation of item level difficulties by facet level difficulties and hence, to a better modelability of inductive reasoning. Every task has its own design matrix, consisting both of facets that were common to all tasks of a mode (e.g., the item-generating rules) and task-specific facets (e.g., the number of correct answers in the rule generating tasks). The analyses were carried out per country and grade. The item difficulties (with different values per country and grade) and the Q matrix (invariant across grades and countries for a specific task) were input to the analyses. This procedure was repeated for each task, making a total of 8 (tasks) x 4 (grades) x 3 countries = 96 analyses. The LLM is applied here as one of two tests of structural equivalence. This type of equivalence addresses the relationship between measurement outcomes and the underlying construct. The facets of the tasks are assumed to influence the Inductive Reasoning 25 difficulty of the items. For example, it can be expected that rules in items of the letter tasks are easier when they involve more letters. The analysis of structural equivalence examines whether the facets exert an influence on item difficulty in each culture. In more operational terms, structural equivalence is supported if the correlation of each analysis is significantly larger than zero. A significant correlation points to a contribution of the facet levels to the item difficulty: It indicates that the facet levels contribute to the prediction of item difficulties. Insert Table 5 about here Hypothesis test. As can be seen in Table 5, the correlations between the unfaceted Rasch item parameters (of equation 1) and the faceted item parameters (of equation 2) were high for all tasks in each grade in each country. These high correlations provide powerful evidence that the same facets influence item difficulty in each country. It can be concluded that Hypothesis 1a, according to which the item difficulty decomposition would be adequate in each country was strongly supported. Insert Table 6 about here The second question involves the patterning of the correlations of Table 5. This question was addressed in an analysis of variance, with country (3 levels: Zambia, Turkey, and the Netherlands), stimulus mode (2 levels: figure and letters tests), and type of skill (4 levels: inductive reasoning and each of the three skill component tasks) as independent variables; the correlation was the dependent Inductive Reasoning 26 variable. The four grades were treated as independent replications. As can be seen in Table 6, all main effects and first order interactions were significant. A significantly lower correlation (and hence a poorer fit of the data to the model) was found for figure tasks than for letter tasks (averages of .87 and .91, respectively), F(2, 72) = 24.85, p < .001. The effect was considerable, explaining 22% of the total score variation. About the same percentage was explained by skill components, F(3, 72) = 37.94, p < .001. The lowest correlation (of .87) was obtained for rule classifications and rule generating (.87), followed by inductive reasoning (.89) while rule testing showed the highest value (.93). The high values of the latter may be due to a combination of a large number of items (long tests) and the presence of both very easy and difficult facet levels in the rule testing tasks in each country; such facet levels increase the dispersion and will give rise to high correlations. Country differences explained about 10% of the score variation; the correlations of the Turkish and Zambian groups were very close to each other (.91 and .90, respectively), while the value for the Dutch group was .87. A closer inspection of the data revealed that the largest differences between the countries were found for tasks with relatively high proportions of correctly solved items. Therefore, the difference in fit may be due to ceiling effects in the Dutch group, which by definition reduce the correlation. The most important interaction, explaining 16% of the total score variation, was observed between country and stimulus mode, F(2, 72) = 19.92, p < .001. Whereas the correlations did not differ more than .03 for both stimulus modes in Zambia and Turkey, the difference in the Dutch sample was .09. The interaction of stimulus mode and skill component was also significant, F(3, 72) = 9.61, p < .001. Correlations of rule generating and rule testing were on Inductive Reasoning 27 average .03 larger than in the figure mode than in the letter mode, while a much larger value of .08 was observed for rule classification. The interaction of country and skill was significant though less important (explaining only 5%). The score differences of the cultures were relatively small for rule generating and rule testing and much larger for inductive reasoning and rule classification, mainly due to the relatively low scores of the Dutch. Again, ceiling effects may have induced the effect (not necessarily the largest attainable score because some facet levels remained beyond the reach of many pupils even the highest scorers). In sum, the analysis of the correlations revealed high values for all tasks in the three countries. The observed country differences were presumably more due to ceiling effects than to country differences in modelability of inductive reasoning and its components. Ceiling effects may also explain the lower internal consistencies in the Dutch data, discussed before. Insert Figure 1 and 2 about here The estimated facet level difficulties (the estimated values of , cf. equation 2) are of all tests have been drawn in Figure 1 (figure tests) and 2 (letters tests). Higher values of refer to more difficult facet levels. The most striking finding of both Figures is the proximity of the three country curves; this points to the crosscultural similarity in pattern of difficult and easy facet levels, which yields further evidence for the structural equivalence of the instruments in the present samples. Furthermore, most facet levels behaved as expected. As for the figure tasks, the third item-generating rule (about alternating additions and subtractions) was Inductive Reasoning 28 invariably the most difficult. Items were more difficult when they dealt with shorter periods, when a variable number of elements were added or subtracted in subsequent figures of a period, when periodicity cues were absent, and when periods did not repeat each other. The number of valid triplets (only present in the rule-generating task) showed large variation. Pupils found it relatively easy to retrieve one correct solution, but relatively difficult to find all solutions when the item contained two or three valid triplets. The difficulty patterning of the letter tasks also followed expectation. Dealing with equal letters was easier than dealing with positions in the alphabet. Items about equal letters within and across groups (e.g., BBBBBB BBBBBB) were easier than items about letters that were equal within and unequal across groups (e.g., BBBBBB GGGGGG). Items were easier when the underlying rule involved more letters (which facilitates recognition). Items about positions in the alphabet (the last two itemgenerating rules of the letter mode) were easier when they involved smaller jumps (e.g., ABCD was easier to recognize as a group of letters in which the position of letters in the alphabet is important than ACEG, that was easier to recognize than ADGJ). Like in the generating task of the figure mode, a strong effect of the number of valid triplets was found. Finding all solutions turned out to be difficult and valid triplets were often overlooked. Measurement Unit Equivalence in Internal Procedure Hypothesis. For each task the same facet level difficulties apply in each country (Hypothesis 1b; cf. Table 4). Inductive Reasoning 29 Statistical procedure. The LLM parameters can also be used to test measurement unit equivalence. This type of equivalence goes beyond structural equivalence by assuming that the tasks as applied in the three countries have the same measurement units (but not necessarily the same scale origins). If the estimated parameters of equation 2 are invariant across countries except for random fluctuations, there is strong evidence for the invariance of the measurement units of the test scales. This invariance would imply that the estimated facet level difficulties in a particular country could be replaced by the difficulty of the same facet in another country without affecting the fit of the model. For these analyses the data for the grades in a country were combined because of the primary interest in country differences. Hypothesis test. Standard errors of the estimated facet level difficulties ranged from 0.05 to 0.10. As can be derived from Figure 1 and 2, in each task there are facet levels that differ significantly across countries. It can be safely concluded that scores did not show complete measurement unit equivalence. Yet, it is also clear from these Figures that some facet levels are not significantly different across countries. So, the question arises to what extent facet levels are identical across countries. The question was addressed using intraclass correlations, measuring the absolute agreement of the estimated facet level difficulties in the three countries. The absolute agreement of the estimated basic parameters of a single task across countries was evaluated; per task the intraclass correlation of the country by facet level matrix was computed. The letter tasks showed consistently higher values than the figure tasks. The average agreement coefficient was .91 for the figure tasks and .96 for the letter tasks (all intraclass Inductive Reasoning 30 correlations were significantly above zero, p < .001). The high within-task agreement points to an overall strong agreement of facet levels across countries. The estimated facet level difficulties come close to being interchangeable across countries (despite the significant differences of some facet levels). A recurrent theme in the analysis is the better modelability of the letter tasks as compared to the figure tasks, due to wider range of facet level difficulties in the letter than in the figure mode. The range differences may be a consequence of the choices made in the test design stage. One of the problems of many existing figure tests is their often implicit definition of permitted stimulus transformations (e.g., rotating and flipping). This lack of clarity, presumably an important source of crosscultural score differences, was avoided in the present study by spelling out all permitted transformations in the test instructions. Apparently, the price to be paid for providing the pupils with this information is a small variation in facet level difficulties. Structural Equivalence in External Procedure Hypothesis. The skill components contribute to inductive reasoning in each country (Hypothesis 2a; cf. Table 4). Insert Figure 3 about here Statistical procedure. External procedures to establish equivalence scrutinize the relationships between inductive reasoning and its componential skills. A specific type of structural equation model was used, namely a MIMIC model (Multiple Indicators MultIple Causes; see Van Haaften & Van de Vijver, 1996, for another Inductive Reasoning 31 cross-cultural application). A MIMIC is a model that links input and output through one latent variable (see Figure 3). The core of the model is the latent variable, labeled inductive reasoning. This variable, , is measured by the two tasks of inductive reasoning (the output variables). The input to the inductive reasoning factor comes from the skill components; the components are said to influence the latent factor and this influence is reflected in the two tasks of inductive reasoning. In sum, the MIMIC model states that inductive reasoning is measured by two tasks (IRF and IRL) and is influenced by three components (classification, rule generating, and rule testing). The model equations are as follows: y1 = 1 + 1; (3) y2 = 2 + 2, in which y1 and y2 denote observed scores on the two tasks of inductive thinking, 1 and 2 the factor loadings, and 1 and 2 error components. The latent variable, , is linked to the skill components in a linear regression function: = 1 x1 + 2 x2 +3 x3 + , (4) where the gammas are the regression coefficients, the x-variables refer to the skill components, and is the error component. In order to make the estimates identifiable, the factor loading of IRF, 1, was fixed at one. An attractive feature of structural equation modeling is its allowance for multigroup analysis. This means that the adequacy of the above model equations for the data can be evaluated for all 12 data sets (4 grades x 3 countries) at once. The fit statistics yield an overall assessment, covering all data sets. Inductive Reasoning 32 The theoretical model underlying the study stipulates that the three skill components constitute essential elements of inductive reasoning. In terms of the MIMIC analysis, this means that structural equivalence would be supported by a good fit of a model with three input and two output variables as described. Nested models were analyzed. In the first step all parameters were held fixed across data sets, while in subsequent steps similarity constraints were lifted in the following order (cf. Table 7): the error variance (unreliability) of the tasks of inductive reasoning, the intercorrelations of the tasks of skill component, the error variance of the latent variable, the regression coefficients, and the factor loadings. The order was chosen in such a way that invariance of relationships involving the latent variable (i.e., regression coefficients and factor loadings) was retained as long as possible. More precisely, structural equivalence would be supported when the MIMIC model with the fewest equality constraints across countries shows a good fit and all MIMIC parameters differ from zero (hypothesis 2a). It would mean that the tasks of inductive reasoning constitute a single factor that is influenced by the same skill component in each analysis (the possibility that there is a good fit but that some regression coefficients or factor loadings are negative is not further considered here because no covariances were negative). Insert Table 7 about here Hypothesis test. The relationship of the skill components and inductive reasoning tasks was examined in a MIMIC model (see Table 7; more details are given in Appendix C). Nested models were fitted to the data of both stimulus modes. Inductive Reasoning 33 The choice of a MIMIC model was mainly based on the relatively large change of all fit statistics when constraints were imposed on the phi matrices (the covariance matrices of the component skills; see the figure of Appendix B); therefore, the model with equal factor loadings, regression coefficients, and error variances was chosen. Although the letter tasks showed a better fit than the figure tasks, the choice of a model was less straightforward. A MIMIC model with a similar pattern of free and fixed parameters in both stimulus modes was chosen, mainly because of parsimony (see footnote to Table 7 for a more elaborate explanation). The standardized solution of the two models is given in Figure 3. As hypothesized, all loadings and regression coefficients were positive and significant (p < .01). It can be concluded that inductive reasoning with figure and letter stimuli involves the same components in each country. This supports structural equivalence, as predicted in hypothesis 2a. The regression coefficients of the figure component tasks were unequal to each other: rule classification was least important, followed by rule generating, while rule testing showed the largest contribution to inductive reasoning. The letter mode did not show this patterning; the regression coefficients of the component tasks of the letter mode were rather similar to one another. Measurement Unit Equivalence in External Procedure Hypothesis. The skill components contribute in the same way to inductive reasoning in each country (Hypothesis 2b; cf. Table 4). Statistical procedure. Measurement unit equivalence can be scrutinized by introducing and testing equality constraints in the MIMIC model. This type of Inductive Reasoning 34 equivalence would be supported when a single MIMIC model with identical parameter values holds in all countries. It may be noted that this test is stricter than the ones proposed in the literature. Whereas the latter tend to analyze all tasks in a single exploratory or confirmatory factor analysis, more specific relationships between the tasks are considered here. Hypothesis test. The psychologically most salient elements of the MIMIC, the factor loadings, regression coefficients, and the explained variance of the latent variable, were found to be invariant across countries. However, measurement unit equivalence also requires the other parameter matrices to be invariant. In the figure mode the model with equality constraints for all matrices showed a rather poor fit, with an NNFI of .88, a GFI of .96, and an RMSEA of .045. An inspection of the delta chi square values indicated that in particular the introduction of equality of covariances of the skill components () reduced the fit significantly. The letter tasks showed a similar picture; the most restrictive model revealed values of .89 for the NNFI, .82 for the GFI, and .041 for the RMSEA, which can be interpreted as a rather poor fit. Again, equality of the matrices led to a significant reduction of the fit. Like in our internal procedure to examine measurement unit equivalence, we found some but inconclusive evidence for the measurement unit equivalence of the task scores across countries; hypothesis 2b had to be rejected. Full Score Equivalence Hypothesis. Both tasks of inductive reasoning show full score equivalence (Hypothesis 3; cf. Table 4). Inductive Reasoning 35 Statistical procedure. Full score equivalence can be examined in an item bias analysis. A logistic regression model was applied to analyze item bias (Rogers & Swaminathan, 1993). Advantages of the model are the possibility to include more than two groups and to examine both uniform and nonuniform bias (Mellenbergh, 1982). The combined samples of the three countries are used to determine cutoff scores that split up the sample in three score level groups (low, medium, and high) of about the same size. In the logistic regression procedure, culture (dummy coded), score level, and their interaction are the independent variables, while the item response is the dependent variable. A significant main effect of culture points to uniform bias: individuals from at least one country show an unexpectedly low or high score across all score levels on the item as compared to individuals with the same test score from other cultures. A significant interaction points to nonuniform bias: the systematic difference of the scores depends here on the score level; for example, country differences in scores among low scorers are not found among high scorers. Alpha was set at a (low) level of .001 in the item bias analyses in order to prevent inflation of Type I errors, due to multiple testing (although, obviously, the power of the procedure is adversely affected by this choice). Insert Figure 4 about here Hypothesis test. In the introduction two approaches were mentioned to examine full score equivalence that are based on structural equation modeling: multilevel covariance structure analysis and the modeling of latent means. The former could not be used due to the small number of countries involved, while the Inductive Reasoning 36 latter was precluded because of the incomplete support of measurement unit equivalence. This lack of support indeed prohibits any analysis of full score equivalence. Yet, because the bias analysis yielded interesting results, it is reported here for exploratory purposes. Of the 30 items of the IRF, 15 items were biased (13 items uniform, 11 items non-uniform), mainly involving the Dutch—Zambian comparison. The occurrence of bias was related to the difficulty of the items; both the easiest and most difficult items showed the most bias. The correlation between the presence of bias (0 = absent, 1 = present) and the deviance of the item score from the mean (i.e., average item score - overall average) was .64 (p < .001). The correlation suggests a methodological artifact, such as floor and ceiling effects. This was confirmed by an inspection of the contingency tables underlying the logistic regression analyses. Figure 4 depicts empirical item characteristic curves of two items that showed both uniform and nonuniform bias. The upper panel shows a relatively easy item (with an overall mean of .79) and the lower panel a relatively difficult item (mean of .33). The bias for the easy item is induced by country differences at the lowest score level that are not reproduced at the higher levels. Analogously, the scores for the difficult item remain close to the guessing level (of .20) in the two lowest score levels, while there is more score differentiation in the highest scoring group. The score patterns of Figure 4 were found for several items. It appears that ceiling and floor effects led to item bias in the IRF. Three items were found to be biased in the IRL (one uniform and two nonuniform). The Zambian pupils showed relatively high scores on these items. Because the items were few and involved different facet levels, the reasons for the bias were not understood, a fairly common finding in item bias research (cf. Van de Inductive Reasoning 37 Vijver & Leung, 1997b). Floor and ceiling effects did not occur, which points to an important difference between the two tasks of inductive reasoning; whereas at the IRF pupils tended to answer items either with a very low of a very high level of accuracy, pupil scores at the IRL varied more gradually. Similarly, in the IRF there were no facet levels that were either too difficult or too easy for most of the sample, but both types of facet levels were present in the IRL. Discussion The equivalence of two tasks of inductive reasoning was examined in a crosscultural study involving 632 Dutch, 877 Turkish, and 704 Zambian pupils from the highest two grades from primary and the lowest two grades from secondary school. Two stimulus modes were examined: letters and figures. In each mode tasks for inductive reasoning and for each of its components, classification, generation, and testing, were administered. The structural, measurement unit, and full score equivalence of the instruments in these countries were studied. A MIMIC model was fitted to the data, linking skill components to inductive reasoning through a latent variable, labeled inductive reasoning (external procedure). A linear logistic model was utilized to examine to what extent in each country item difficulties could be adequately decomposed into the underlying rules that were used to generate the items (internal procedure). In keeping with past research, structural equivalence was strongly supported; yet, measurement unit equivalence was not fully supported. It is interesting to note that two different statistical models, item response theory (LLM) and structural equation modeling (MIMIC), looking at different aspects of the data (facet level difficulties in the LLM and covariances of tasks with componential skills) yielded the same conclusion about measurement unit equivalence. Inductive Reasoning 38 Critics might argue that the emphasis on equivalence of the present study is a misnomer and detracts the attention form the real cross-cultural differences observed with these carefully constructed instruments. In this line of reasoning the results would show massive differences in inductive reasoning across countries, with Zambian pupils having the lowest skill level, Turkish pupils having an intermediate position, and Dutch pupils having the highest level. The validity of this conclusion is underscored by the LLM analyses in which it was found that most facet level difficulties are identical and interchangeable across the three countries while a small number is country dependent. Even if score comparisons are restricted to the facet levels with the same difficulties, at least some of the score differences of the countries are likely to remain. In this line of reasoning the current study has demonstrated the presence of at least some but presumably large differences in inductive reasoning, with Western subjects showing the highest skill levels. In my view the interpretation is based on a simplistic and untenable view on country score differences. These differences are not just a matter of differences in inductive reasoning skills. It may well be that differences of country scores on the tasks are partly or entirely due to additional factors. Various educational factors may play a role here, as is often the case in comparisons of highly dissimilar cultural groups. In a meta-analysis Van de Vijver (1997) has found that educational expenditure is a significant predictor of country differences in mental test performance. Does quality of schooling have an influence on inductive reasoning? I concur with Cole (1996), who after reviewing the available cross-cultural evidence, concluded that schooling does not have a formative influence on higher-order forms of thinking but tends to broaden the domains in which these skills can be successfully applied. Schooling Inductive Reasoning 39 facilitates the usage of skills by their training and by exposure to psychological and educational tests (cf. Rogoff, 1981; Serpell, 1993). The educational differences of the populations of the current study are massive. For example, attending kindergarten is more common in Turkey and the Netherlands than in Zambia, and schools in Zambia have a fraction of the learning material that schools in Turkey and the Netherlands have at their disposal. The interpretation of the country differences observed in the present study as reflecting real differences is based on an underestimation of the impact of various context-related (educational) factors and an overestimation of ability of the tasks employed here to measure inductive reasoning in all countries. Tasks that capitalize less on schooling and are more derived from everyday experiences may show a different patterning of country differences. The present results replicate the findings of many studies on structural equivalence; strong support was found that the instruments measure inductive reasoning in the three countries. The present results make it very unlikely that there are major cross-cultural differences in strategies and processes involved in inductive reasoning in the populations studied. These results extend findings of numerous factor analytic studies in showing that skill components contribute in a largely identical way to inductive thinking and item difficulty is governed by complexity rules that are largely identical across cultures. The results also show that comparisons of scores obtained in different countries are not allowed, despite the careful item construction process. This negative finding on the numerical comparability may be due to the large cultural distance of the countries involved here. However, it also points to the need to address measurement unit and full score equivalence in cross-cultural research. Inductive Reasoning 40 Cross-cultural comparisons of data that have not been scrutinized for equivalence, abound in the literature. In fact, it is rather difficult to find examples of data in which the equivalence was examined in an appropriate way. Scores are often numerically compared across cultures (assuming full score equivalence) when only structural equivalence has been demonstrated; examples can be found in the cross-cultural comparisons of the Eysenck personality scales (e.g., Barrett et al., 1998). It is difficult to defend the practice to compare scores across cultures when equivalence has not been tested or when only structural equivalence has been observed. The present study underscores the need to study equivalence of data before comparing test scores. A more prudent treatment of cross-cultural score differences is badly needed. We have firmly established the commonality of basic cognitive functions in several cultural and ethnic groups (Waitz’s “psychic unity”), but we still have to come to grips with the question of how to design cognitive tests that allow for numerical score comparisons across a wide cultural range. A final issue concerns the external validity of the present findings: To what populations can the present results be generalized? The three countries involved in the study have a highly different status on affluence. Given the strong findings on structural equivalence, it is realistic to assume that inductive reasoning is a universal with largely identical components in schooled populations, at least as of the end of primary school. Future studies should address the question of whether measurement unit equivalence would be fully supported when the cultural distance between the countries is smaller. Inductive Reasoning 41 References Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The Eysenck Personality Questionnaire: An examination of the factorial similarity of P, E, N, and L across 34 countries. Personality and Individual Differences, 25, 805-819. Campbell, D. T., & Naroll, R. (1972). The mutual methodological relevance of anthropology and psychology. In F. L. K. Hsu (Ed.), Psychological anthropology. Cambridge, MA: Schenkman. Carroll, J. B. (1993). Human cognitive abilities. A survey of factor-analytic studies. Cambridge: Cambridge University Press. Claassen, N. C., & Cudeck, R. (1985). Die faktorstruktuur van die Nuwe SuidAfrikaanse Groeptoets (NSAG) by verskillende bevolkingsgroepe [The factor structure of the New South African Group Test (NSAGT) in various population groups.]. South-African Journal of Psychology, 15, 1-10. Cole, M. (1996). Cultural psychology: A once and future discipline. Cambridge, MA: Harvard University Press. Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Kit of factorreferenced tests. Princeton, NJ: Educational Testing Service. Ellis, B. B. (1990). Assessing intelligence cross-nationally: A case for differential item functioning detection. Intelligence, 14, 61-78. Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item response theory evaluation of an English version of the Trier Personality Inventory (TPI). Journal of Cross-Cultural Psychology, 24, 133-148. Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197. Inductive Reasoning 42 Fan, X., Willson, V. L., & Reynolds, C. R. (1995). Assessing the similarity of the factor structure of the K-ABC for African-American and White children. Journal of Psychoeducational Assessment, 13, 120-131. Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests [Introduction to the theory of psychological tests]. Bern: Huber. Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models. Foundations, recent developments and applications. New York: Springer. Frijda, N., & Jahoda, G. (1966). On the scope and methods of cross-cultural research. International Journal of Psychology, 1, 109-127. Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of the WISC-R different for Anglo- and Mexican-American children? Journal of Psychoeducational Assessment, 6, 253-260. Greenfield, P. M. (1997). You can't take it with you: Why ability assessments don't cross cultures. American Psychologist, 52, 1115-1124. Gustafsson, J-E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8, 179-203. Hakstian, A. R., & Vandenberg, S. G. (1979). The cross-cultural generalizability of a higher-order cognitive structure model. Intelligence, 3, 73-103. Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for independent alpha coefficients. Psychometrika, 41, 219-231. Hennessy, J. J., & Merrifield, P. R. (1976). A comparison of the factor structures of mental abilities in four ethnic groups. Journal of Educational Psychology, 68, 754-759. Inductive Reasoning 43 Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning. Hillsdale, NJ: Erlbaum. Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin, 86, 721-735. Irvine, S. H. (1969). Factor analysis of African abilities and attainments: Constructs across cultures. Psychological Bulletin, 71, 20-32. Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology and its contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H. Poortinga (Eds.), Cross-cultural contributions to psychology. Lisse, the Netherlands: Swets & Zeitlinger. Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In S. H. Irvine & J. W. Berry (Eds.), Human abilities in cultural context. Cambridge: Cambridge University Press. Jahoda, G., & Krewer, B. (1997). History of cross-cultural and cultural psychology. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of crosscultural psychology (2nd ed., vol. 1). Chicago: Allyn & Bacon. Jensen, A. R. (1977). Cumulative deficit in intelligence of Blacks in the rural South. Developmental Psychology, 13, 184-191. Jensen, A. R. (1980). Bias in mental testing. New York: Free Press. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53-76. Inductive Reasoning 44 McCrae, R. R., & Costa, P. T., (1997). Personality trait structure as a human universal. American Psychologist, 52, 509-516. Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118. Messick, S. (1988). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed). Hillsdale, NJ: Erlbaum. Muthén, B. O. (1991). Multilevel factor analysis of class and student achievement components. Journal of Educational Measurement, 28, 338-354. Muthén, B. O. (1994). Multilevel covariance structure analysis. Sociological Methods & Research, 22, 376-398. Naglieri, J. A., & Jensen, A. R. (1987). Comparison of Black-White differences on the WISC-R and the K-ABC: Spearman's hypothesis. Intelligence, 11, 21-43. Poortinga, Y. H., & Van de Vijver, F. J. R. (1987). Explaining cross-cultural differences: Bias analysis and beyond. Journal of Cross-Cultural Psychology, 18, 259-282. Ree, M. J., & Carretta, T. R. (1995). Group differences in aptitude factor structure on the ASVAB. Educational and Psychological Measurement, 55, 268-277. Reschly, D. (1978). WISC-R factor structures among Anglos, Blacks, Chicanos, and Native-American Papagos. Journal of Consulting and Clinical Psychology, 46, 417-422. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Inductive Reasoning 45 Rogoff, B. (1981). Schooling and the development of cognitive skills. In H. C. Triandis & A. Heron (Eds.), Handbook of cross-cultural psychology: Volume 4, Developmental psychology. Boston: Allyn & Bacon. Sandoval, J. (1982). The WISC-R factorial validity for minority groups and Spearman's hypothesis. Journal of School Psychology, 20, 198-204. Serpell, R. (1979). How specific are perceptual skills? British Journal of Psychology, 70, 365-380. Serpell, R. (1993). The significance of schooling. Life journeys in an African society. Cambridge: Cambridge University Press. Sternberg, R. J. (1977). Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities. New York: Wiley. Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual Review of Psychology, 49, 479-502. Sung, Y. H., & Dawis, R. V. (1981). Level and factor structure differences in selected abilities across race and sex groups. Journal of Applied Psychology, 66, 613-624. Taylor, R. L., & Ziegler, E. W. (1987). Comparison of the first principal factor on the WISC-R across ethnic groups. Educational and Psychological Measurement, 47, 691-694. Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs, No. 1. United Nations (1999). Indicators on education [On-line]. Available Internet: www.un.org/depts/unsd/social/education.htm. Inductive Reasoning 46 Valencia, R. R., & Rankin, R. J. (1986). Factor analysis of the K-ABC for groups of Anglo and Mexican American children. Journal of Educational Measurement, 23, 209-219. Valencia, R. R., Rankin, R. J., & Oakland, T. (1997). WISC-R factor structures among White, Mexican American, and African American children: A research note. Psychology in the Schools, 34, 11-16. Van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied Psychological Measurement, 10, 45-57. Van de Vijver, F. J. R. (1997). Meta-analysis of cross-cultural comparisons of cognitive test performance. Journal of Cross-Cultural Psychology, 28, 678-709. Van de Vijver, F. J. R., & Leung, K. (1997a). Methods and data analysis of comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of cross-cultural psychology, 2nd Ed., Vol. 1. Chicago: Allyn & Bacon. Van de Vijver, F. J. R., & Leung, K. (1997b). Methods and data analysis for cross-cultural research. Newbury Park, CA: Sage. Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer. Van Haaften, E. H., & Van de Vijver, F. J. R. (1996). Psychological consequences of environmental degradation. Journal of Health Psychology, 1, 411429. Willemsen, M. E., & Van de Vijver, F. J. R. (under review). Context effects in logical reasoning in the Netherlands and Zambia. Inductive Reasoning 47 Zuckerman, M., Kuhlman, D. M., Thornquist, M., & Kiers, H. A. L. (1991). Five (or three) robust questionnaire scale factors of personality without culture. Personality and Individual Differences, 12, 929-941. Inductive Reasoning 48 Table 1 Sample Size per Culture, Grade, and Experimental Condition Gradea Country Test condition Zambia Figure Turkey Netherlands Total aIn 5 6 7 8 80 79 94 123 376 Letter 81 81 87 79 328 Figure 127 97 95 102 421 Letter 139 107 110 100 456 Figure 117 74 51 77 319 Letter 83 91 77 62 313 627 529 514 Zambia the grades are 6, 7, 8, and 9, respectively. Total 543 2213 Inductive Reasoning 49 Table 2 Average Proportion of Correctly Solved Items per Task, Grade, and Culture Task Country Zambia Turkey Netherlands Grade IRF RCF RGF RTF 6 .40 .44 .53 .39 7 .55 .53 .55 8 .56 .64 9 .62 5 IRL RCL RGL RTL .49 .40 .37 .41 .43 .56 .60 .47 .56 .55 .53 .58 .61 .44 .58 .68 .64 .54 .61 .54 .48 .58 .47 .51 .44 .42 .50 .54 .39 .49 6 .48 .57 .56 .46 .52 .53 .42 .47 7 .66 .73 .64 .58 .64 .70 .56 .65 8 .65 .75 .69 .63 .64 .71 .52 .60 5 .67 .80 .65 .64 .60 .63 .51 .58 6 .74 .73 .72 .67 .64 .68 .57 .66 7 .70 .74 .70 .66 .68 .72 .63 .67 8 .78 .84 .76 .77 .70 .74 .60 .69 IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule Generating Figures; RTF: Rule Testing Figures; IRL: Inductive Reasoning Letters; RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing Letters Inductive Reasoning 50 Table 3 Effect Sizes of Multivariate Analyses of Variance of the Psychological Tests per Test Mode Skill component Independent Multi- Inductive Rule Rule Rule variable variatea reasoning classification generating testing (a) Figure mode Country (C) .135*** .132*** .183*** .139*** .200*** Grade (G) .073*** .108*** .149*** .132*** .125*** Sex (S) .011* .001 .002 .001 .010** CG .035*** .013* .063*** .041*** .014* CS .012** .016*** .017*** .014*** .014*** GS .009** .004 .006 .004 .011** CGS .009* .010 .007 .008 .004 (b) Letter mode Country .102*** .078*** .113*** .130*** .113*** Grade .061*** .122*** .114*** .088*** .125*** Sex .014** .010** .002 .000 .001 CG .030*** .035*** .051*** .028*** .051*** CS .014*** .014** .002 .013** .016*** GS .005 .007 .000 .003 .001 CGS .014*** .017** .012 .018** .009 Note. Significance levels of the effect sizes refer to the probability level of the corresponding F ratio of the independent variable(s). aWilks’ lambda. *p < .05. **p < .01. ***p < .001. Inductive Reasoning 51 Table 4 Overview of the Hypothesis Tests and the Statistical Models Used Procedure to establish equivalence Internal Focus on tests of inductive reasoning Focus on tests of inductive reasoning External Focus on relationship of skill components and inductive reasoning Question examined Statistical model used Are facet level difficulties and item difficulties related? linear logistic model Is there item bias? Logistic regression Are tests of skill components and inductive reasoning related? structural equation modeling MIMIC = Multiple Indicators MultIple Causes Statistical aspects Conditions for equivalence Structural Measurement unit Full score equivalence equivalence equivalence correlations significant in each country (hypothesis 1a) correlations significant and identical across countries (hypothesis 1b) Absence of item bias (hypothesis 3) MIMIC parameters significant in each country (hypothesis 2a) MIMIC parameters significant and identical across countries (hypothesis 2b) Inductive Reasoning 53 Table 5 Accuracy of the Design Matrices per Task and per Country: Means (and Standard Deviations) of Correlation Stimulus mode Figures Skill Zam Tur Letters Net Zam Tur Net Inductive reasoning .90 (.03) .90 (.02) .81 (.03) .90 (.02) .92 (.01) .90 (.02) Rule classification .84 (.04) .87 (.01) .76 (.02) .92 (.01) .93 (.02) .88 (.03) Rule generating .88 (.01) .86 (.03) .81 (.02) .87 (.01) .89 (.02) .92 (.02) Rule testing .95 (.01) .93 (.01) .91 (.03) .94 (.01) .95 (.01) .94 (.01) Net = Netherlands. Tur = Turkey. Zam = Zambia. Inductive Reasoning 54 Table 6 Analysis of Variance of Correlations with Country, Stimulus Mode, and Skill as Independent Variables Source df F Variance explained Country (C) 2 24.85*** .10 Stimulus mode (S) 1 79.18*** .22 Skill (Sk) 3 37.94*** .21 CS 2 19.92*** .16 C Sk 6 3.70** .05 S Sk 3 9.61*** .10 C S Sk 6 1.95 .03 Within-cell error 72 (.0006) .14 *p < .05. **p < .01. ***p < .001. Inductive Reasoning 55 Table 7 Fit Indices for Nested Multiple Indicators Multiple Causes Models of Figure and Letter Tasks Contribution to per country(percentage) Invariant matrices Zam Tur Net NNFI GFI RMSEA (df) 2 (df) (a) Figure mode 533.97*** (167) 21 32 47 .88 .96 .045 y 437.28*** (145) 19 35 47 .89 .96 .043 96.69*** (22) y 180.52*** (79) 20 27 53 .93 .98 .034 256.76*** (66) y 134.01*** (46) 20 19 61 .91 .98 .042 46.51 (33) y 98.72*** (35) 20 18 62 .90 .99 .041 35.29*** (11) y (b) Letter mode 473.33*** (167) 33 33 34 .89 .82 .041 y 364.63*** (145) 28 30 41 .91 .87 .037 108.70*** (22) y 180.60*** (79) 35 26 39 .92 .90 .034 184.03*** (66) y 82.07** (46) 56 25 19 .95 .94 .027 98.53*** (33) y 61.61** (35) 54 24 23 .96 .94 .027 20.46* (11) y Note. The choice of a MIMIC model for the figure tests was mainly based on the relatively large change of all fit statistics when constraints were imposed on the phi matrices; therefore, the model with equal factor loadings, regression coefficients, and error variances was chosen. The same model of free, fixed and constrained parameters also showed an adequate fit for the letter tests. Releasing constraints on the regression coefficients revealed a significant increase of fit. The question was addressed as to whether the decrease of the statistic was due to systematic country differences in the regression coefficients. An inspection of the regression coefficients per country did not show a clear patterning of country differences. The same question was also addressed by two more analyses; in the first the regression coefficients were allowed to vary across countries but not across grades while in the second analysis variation was allowed across grades but not across countries. It was found that equality of regression coefficients across the four grades of a country yielded a poorer fit than equality across the three countries per grade (first analysis: (73, N = 1094) = 168.40, p < .001; GFI = .91, NNFI = .92, RMSEA = .035; second analysis: (70, N = 1094) = Inductive Reasoning 56 131.54, p < .001; GFI = .90, NNFI = .95, RMSEA = .029). The two analyses confirmed that a choice of a model of equal regression coefficients of the letter mode across countries does not lead to the elimination of relevant country differences. Net = the Netherlands; Tur = Turkey; Zam = Zambia; NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation; 2 = decrease of 2 value. *p < .05. **p < .01. ***p < .001. Inductive Reasoning 57 Figure Captions Figure 1. Estimated facet level difficulties per test and country of the figure mode Note. The first level of each facet (see Appendix A), arbitrarily set to zero is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2 ; R3: Item rule: 3; P3: Number of figures per period: 3; P4: Number of figures per period: 4; D2: Number of different elements of subsequent figures: 2 ; D3: Number of different elements of subsequent figures: 3; DV: Number of different elements of subsequent figures: variable; V: Variation across periods: variable; C: Periodicity cues: absent; PR: Periods repeat each other: no; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no. Figure 2. Estimated facet level difficulties per test and country of the letter mode Note. The first level of each facet (see Appendix A), arbitrarily set to zero, is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2; R3: Item rule: 3; R4: Item rule: 4; R5: Item rule: 5; L2: Number of letters: 2; L3: Number of letters: 3; L4: Number of letters: 4; L5: Number of letters: 5; L6: Number of letters: 6; LV: Number of letters: variable; D2: Difference in positions in alphabet: 2; D3: Difference in positions in alphabet: 3; D4: Difference in positions in alphabet: 4; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; V4: Number of valid triplets: 4; V5: Number of valid triplets: 5; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no. Figure 3. Multiple Indicators Multiple Causes model (standardized solution). Figure 4. Examples of biased items: (a) easy item; (b) difficult item Inductive Reasoning 58 (b) Rule Classification Figures 2.5 1.5 2 Difficulty Difficulty (a) Inductive Reasoning Figures 2 1 0.5 0 1.5 1 0.5 0 -0.5 -0.5 -1 -1 R2 R3 P3 P4 D2 Facet level D3 V R2 PR R3 P3 P4 D2 D3 DV Facet level V C PR F C F NR (d) Rule Testing Figures (c) Rule Generating Figures 2 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 Difficulty Difficulty 1.5 1 0.5 0 -0.5 -1 R2 R3 D2 D3 Facet level V2 Zam V3 R2 Tur R3 P3 Net P4 D2 D3 DV Facet level V Inductive Reasoning 59 (b) Rule Classification Letters (a) Inductive Reasoning Letters 2.5 1.5 0.5 Difficulty Difficulty 1.5 -0.5 0.5 -0.5 -1.5 -2.5 -1.5 -3.5 -2.5 R2 R3 R4 R5 L2 L3 L4 L5 Facet level L6 D2 D3 R2 D4 R3 R4 R5 (c) Rule Generating Letters L3 L4 L5 Facet level L6 D2 D3 D4 F L5 L6 LV D2 D3 D4 Facet level F NR (d) Rule Testing Letters 2.5 3 Difficulty 1.5 Difficulty L2 0.5 -0.5 -1.5 2 1 0 -2.5 -1 -3.5 -2 R2 R3 R4 R5 L2 L3 L4 D2 D3 D4 V2 V3 V4 V5 Facet level Zam R2 R3 R4 R5 Tur Net L3 L4 Inductive Reasoning 60 Figure mode Rule classification figures .24 .15 .73 Rule generating figures .39 Inductive reasoning figures Inductive reasoning .67 .46 Rule testing figures Inductive reasoning letters Letter mode Rule classification letters .37 .23 .63 Rule generating letters .34 Inductive reasoning .74 .34 Rule testing letters Inductive reasoning figures Inductive reasoning letters Inductive Reasoning 61 (a) Easy item 1 0.8 Zam b ia Tu rkey 0.7 Neth erlan d s 0.6 0.5 0.4 Low Med iu m Hig h Score level (b) Difficult item 0 .7 0 .6 Average score Averag e score 0.9 0 .5 Z am b ia 0 .4 T u rkey 0 .3 N eth erlan d s 0 .2 guessing level 0 .1 0 L ow M ed iu m S core level H ig h Inductive Reasoning 62 Appendix A: Test Facets The following table provides a description of the facets of the examples of the figure tests: Faceta Item rule Number of figures per period Number of different elements of subsequent figures Variation across periods Periodicity cues Periods repeat each other Number of valid triplets Number of valid triplets One of the alternatives follows the rule Level 1 2 3 2 3 4 1 2 3 Variable constant Variable Present Absent Yes No 1 2 3 Yes IRF * RCF Test RGFb RTF * * * * * * * * * * * * * - * - * * * * * - * No Rows repeat each other Yes * No Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (1-3-5). IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule Generating Figures; RTF: Rule Testing Figures. Inductive Reasoning 63 The following Table provides a description of the facets of the examples of the letter tests: Faceta Item rule Number of letters Difference in positions in alphabet Number of valid triplets One of the alternatives follows the rule Level 1 2 3 4 5 1 2 3 4 5 6 variable 1 2 3 4 1 2 3 4 5 yes IRL RCL Test RGLb * RTL * * * * * * * * - * * - * no yes * no Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (2-4-6). IRL: Inductive Reasoning Letters; RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing Letters. Rows repeat each other Inductive Reasoning 64 Appendix B: Examples of test items (a) Inductive Reasoning Figures: Subject is asked to indicate which row consistently follows one of the item generating rules. 1 2 3 1 2 3 4 5 (Correct answer: 3) 4 5 6 7 8 9 10 11 12 Inductive Reasoning 65 (b) Rule Classification Figures: Subject is asked to indicate which rule applies to the eight figures. 1 2 3 4 5 6 7 8 1. One or more things are added to figures which come after each other in a group. 2. One or more things are subtracted from figures which come after each other in a group. 3. In turn, one or more things are added to figures which come after each other in a group and then, the same number of things is subtracted. 4. None of the rules applies. (Correct answer: 3) Inductive Reasoning 66 (c) Rule Generating Figures: Subject is asked to find one or more groups of three figures that follow one of the item generating rules. 1 - 2 - 3 - 4 - 5 - 6 1 - 2 - 3 - 4 - 5 - 6 1 - 2 - 3 - 4 - 5 - 6 (Correct answers: 3-4-6 and 2-3-4) Inductive Reasoning 67 (d) Rule Testing Figures: Subject is asked to indicate which row of figures follows the rule at the top of the item. The rule is: There are 4 figures in a group. 1 thing is ADDED to figures which come after each other in a group. 1 2 3 1 2 3 4 5 None of these (Correct answer: 4) 4 5 6 7 8 Inductive Reasoning 68 (e) Inductive Reasoning Letters: Subject is asked to indicate which group of letters does not follow the rule of the other four. 1 M L K J IH 2 G FEDCB 3 UTSRQ P 4 O NM LKH 5 XW VUTS (Correct answer: 4) (f) Rule Classification Letters: Subject is asked to indicate which rule applies to the three groups of letters. SRRRTZ VVVWXZ KKKCDF 1. Each group of letters has the same number of vowels. 2. Each group of letters has an equal number of identical letters and these letters are the same across groups. 3. Each group of letters has an equal number of identical letters and these letters are not the same across groups. 4. Each group of letters has a number of letters which appear the same number of positions after each other in the alphabet. 5. Each group of letters has a number of letters which appear the same number of positions before each other in the alphabet. 6. None of the rules applies (Correct answer: 3) Inductive Reasoning 69 (g) Rule Generating Letters: Subject is asked to find one or more groups of three boxes of letters that follow one of the item generating rules. FGHLL OAILL V BCIDOU PQR LLEA 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 (Correct answers: 2-4-6, 1-2-6, and 1-4-5) (h) Rule Testing Letters: Subject is asked to indicate which row of figures follows the rule at the top of the item. The rule is: In each box there are four vowels 1 2 3 4 5 AOUVWI AOUVWI BOUVWI AOUVWJ None of these (Correct answer: 2) SRZEIO SAREIO SAREOI SARDDO VGAOUI VGAOUI VGAOUQ VGAPUQ Inductive Reasoning 71 Appendix C: Parameter Estimates of the MIMIC Model per Mode and Cultural Group A more detailed description of the MIMIC analyses is given here. In order to simplify the presentation and reduce the number of figures to be presented, the covariance matrices of the four grades were pooled per country prior to the analyses (as a consequence, the numbers in this Appendix and in Table 7 are not directly comparable). The table presents an overview of the estimated parameters (top) and fit (bottom). Going from the left to the right in the table, equality constraints are increased, starting with the “core parameters” of the model, the factor loadings ( y), followed by the regression coefficients (), the error variance of the latent construct, labeled Inductive Reasoning (), the covariances of the predictors (), and the error variance of the tasks of inductive reasoning (). Cells with three different numbers represent the parameter estimates for the Dutch, Turkish, and Zambian group, respectively (e.g., the values 1.09, .79, and 0.69 were the factor loadings in these groups of the IRL task in the solution without any equality constraints across cultural groups); cells with one number contain parameter estimates that were set to be identical across countries; cells with an arrow and the word “Same” contain values equal to its left neighboring cell. All parameter estimates are significant (p < .05). Schematic diagram of MIMIC models: Rule Classification Fig/Let Rule Generating Fig/Let Rule Testing Fig/Let Inductive Reasoning (IR) Inductive Reasoning Figures (IRF) Inductive Reasoning Letters (IRL) Inductive Reasoning 72 y Parameter 1.09 0.79 0.69 0.83 2 0.11 0.19 0.21 0.12 0.19 0.20 1 0.18 0.15 0.19 0.20 0.14 0.18 2 0.31 0.34 0.34 0.37 0.33 0.31 3 27.28 39.25 45.69 Same 11 29.48 36.82 35.11 Same 21 111.76 102.06 104.22 Same 22 15.35 23.72 26.31 Same 31 31.79 35.93 33.89 Same 32 30.02 37.66 45.42 Same 33 0.44 4.48 5.86 0.30 4.28 4.73 17.39 18.05 23.25 17.61 18.27 24.53 1 18.73 15.93 19.69 19.49 15.81 19.34 2 Proportion of variance accounted fora IR 0.97 0.79 0.79 0.98 0.79 0.80 IRF 0.43 0.54 0.54 0.48 0.53 0.49 IRL 0.46 0.45 0.40 0.37 0.47 0.45 Fit indices 13.25 8.32 3.58 38.35 (8) (df) (2) (2) (2) .001 .016 .167 .000 prob. 12.20 (2) (df) .002 prob. 0.91 0.96 0.99 0.95 NNFI 0.98 0.99 1.00 0.99 GFI 0.13 0.09 0.05 0.10 RMSEA No equality constraints Invariant parameters across countries y y y y (a) Figure mode 0.82 0.83 0.83 0.82 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.18 0.34 0.33 0.33 0.33 Same Same 37.99 37.99 Same Same 34.14 34.14 Same Same 105.56 105.56 Same Same 22.20 22.20 Same Same 34.06 34.06 Same Same 38.08 38.08 0.43 4.39 4.83 3.10 3.10 3.41 17.42 18.25 24.35 15.27 19.25 25.80 15.27 19.25 25.80 20.08 19.69 15.82 19.42 18.44 16.41 20.29 18.44 16.41 20.29 18.12 0.97 0.47 0.34 0.80 0.55 0.48 0.79 0.49 0.45 0.54 0.40 0.86 0.51 0.45 0.46 0.42 0.57 0.43 0.84 0.51 0.44 0.45 0.40 0.84 0.51 0.43 43.42 (14) 49.67 (16) 101.92 (28) 123.82 (32) .000 5.07 (6) .535 0.97 0.99 0.07 .000 6.25 (2) .044 0.97 0.99 0.08 .000 52.25 (12) .000 0.96 0.97 0.08 .000 21.90 (4) .000 0.96 0.96 0.09 Inductive Reasoning 73 No equality constraints Parameter 1.49 1.23 0.76 2 0.11 0.29 0.20 1 0 11 0.07 0.11 2 0.31 0.17 0.31 3 30.24 37.06 38.53 11 38.14 49.81 49.59 21 161.85 199.72 196.20 22 15.25 17.54 18.99 31 29.74 39.36 46.45 32 19.14 25.90 32.63 33 2.38 1.78 7.62 21.74 14.95 30.96 1 11.03 15.49 21.21 2 y 1.21 0.13 0.30 0.15 0.12 0.07 0.09 0.36 0.18 0.21 Same Same Same Same Same Same 2.85 1.81 4.41 21.49 14.91 35.30 12.22 15.55 19.36 y y y y (b) Letter mode 1.18 1.19 1.19 1.12 0.23 0.22 0.22 0.22 0.09 0.09 0.09 0.10 0.23 0.23 0.23 0.24 Same Same 35.55 35.55 Same Same 46.41 46.41 Same Same 187.85 187.85 Same Same 17.32 17.32 Same Same 38.77 38.77 Same Same 25.97 25.97 3.27 1.98 4.54 2.69 2.69 3.29 21.19 14.85 34.43 21.51 14.42 35.76 21.51 14.42 35.76 22.28 12.78 15.66 20.33 13.34 14.95 22.53 13.34 14.95 22.53 16.45 Inductive Reasoning 74 Proportion of variance accounted fora IR 0.77 0.85 0.66 IRF 0.32 0.44 0.42 IRL 0.68 0.53 0.38 Fit indices 1.32 2.44 2.32 (df) (2) (2) (2) .517 .295 .313 prob. (df) 0.79 0.39 0.62 0.85 0.44 0.53 0.64 0.26 0.48 0.72 0.35 0.56 0.84 0.45 0.52 0.71 0.31 0.52 0.34 0.53 0.81 0.47 0.55 0.28 0.46 0.37 0.57 0.79 0.47 0.54 0.26 0.44 0.76 0.38 0.51 24.42 (8) 54.32 (14) 57.35 (16) 104.51 (28) 187.75 (32) .002 18.34 (2) .000 0.97 0.98 0.07 .000 29.9 (6) .000 0.96 0.98 0.09 .000 3.03 (2) .220 0.96 0.97 0.08 .000 47.16 (12) .000 0.96 0.96 0.08 .000 83.24 (4) .000 0.93 0.92 0.12 prob. 1.01 1.00 1.00 NNFI 1.00 1.00 1.00 GFI 0.00 0.02 0.02 RMSEA Note. Values in cells refer to nonstandardized solution; 1 is fixed at a value of one. aThe last three rows refer to proportions of variance accounted for in the latent variable and the two inductive reasoning tasks, respectively. IR: Inductive Reasoning (latent construct). IRF: Inductive Reasoning Figures. IRL: Inductive Reasoning Letters. NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation.