The Construct Equivalence of a Measure of

advertisement
Inductive Reasoning 1
RUNNING HEAD: Cross-Cultural Equivalence of an Inductive Reasoning Test
Inductive Reasoning in Zambia, Turkey, and The Netherlands:
Establishing Cross-Cultural Equivalence
Fons J. R. van de Vijver
Tilburg University
The Netherlands
Mailing address:
Fons J. R. van de Vijver
Department of Psychology
Tilburg University
PO Box 90153
5000 LE Tilburg
The Netherlands
Phone: +31 13 466 2528
Fax: +31 13 466 2370
E-mail: fons.vandevijver@kub.nl
Acknowledgment. The help of Cigdem Kagitcibasi and Robert Serpell in making the
data collection possible in Turkey and Zambia is gratefully acknowledged.
Inductive Reasoning 2
Abstract
Tasks of inductive reasoning and its component processes were administered to 704
Zambian, 877 Turkish, and 632 Dutch pupils from the highest two grades of primary
and the lowest two grades of secondary school. All items were constructed using
item-generating rules. Three types of equivalence were examined: structural
equivalence (Does an instrument measure the same psychological concept in each
country?), measurement unit equivalence (Do the scales have the same metric in
each country?), and full score equivalence (full comparability of scores across
countries). Structural and measurement unit equivalence were examined in two
ways. First, a MIMIC (multiple indicators, multiple causes) structural equation model
was fitted, with tasks for component processes as input and inductive reasoning
tasks as output. Second, using a linear logistic model, the relationship between item
difficulties and the difficulties of their constituent item-generating rules was
examined in each country. Both analyses of equivalence provided strong evidence
for structural equivalence, but only partial evidence for measurement unit
equivalence; full score equivalence was not supported.
Inductive Reasoning 3
Equivalence of a Measure of Inductive Reasoning
in Zambia, Turkey, and The Netherlands
Inductive reasoning has been a topic of considerable interest to cross-cultural
researchers, mainly because of its strong relationship with general intelligence
(Carroll, 1993; Gustafsson, 1984; Jensen, 1980). Many cultural populations have
been studied using common tasks of inductive reasoning such as number series
extrapolations (e.g., How should the following series be continued: 1, 4, 9, 16,...?),
figure series extrapolations such as Raven’s Progressive Matrices, analogical
reasoning (e.g., Complete the following: day : night :: white : ...?), and exclusion
tasks (e.g., Mark the odd one out: (a) 21, (b) 14, (c) 28, (d) 63, (e) 32). Studies of
inductive reasoning among nonwestern populations were reviewed by Irvine (1969,
1979; Irvine & Berry, 1988). He concluded that the structure found among western
participants with exploratory factor-analytic techniques is usually replicated. More
recent comparative studies, often based on comparisons of ethnic groups in the
U.S.A., have confirmed this conclusion (e.g., Fan, Willson, & Reynolds, 1995; Geary
& Whitworth, 1988; Hakstian & Vandenberg, 1979; Hennessy & Merrifield, 1976;
Naglieri & Jensen, 1987; Ree & Carretta, 1995; Reschly, 1978; Sandoval, 1982;
Sung & Dawis, 1981; Taylor & Ziegler, 1987; Valencia & Rankin, 1986; Valencia,
Rankin, & Oakland, 1997). Major differences in structure (for instance as reported by
Claassen & Cudeck, 1985) are exceptional. Inductive reasoning provides a strong
case for what Waitz, a nineteenth century philosopher, called “the psychic unity of
mankind” (Jahoda & Krewer, 1997), according to which the basic structure and
operations of the cognitive system are universal while manifestations of these
Inductive Reasoning 4
structures may vary across cultures, depending on what is relevant in a particular
cultural context.
The validity of cross-cultural comparisons can be jeopardized by bias;
examples of bias sources are country differences in stimulus familiarity (Serpell,
1979) and item translations (Ellis, 1990; Ellis, Becker, & Kimmel, 1993). Bias refers
to the presence of score differences that do not reflect differences in the target
construct. Much research has been reported on fair test use; the question is
addressed there whether a test predicts an external criterion such as job success
equally well in different ethnic, age or gender groups (e.g., Hunter, Schmidt, &
Hunter, 1979). The present study does not study bias in test use but bias in test
meaning; in other words, no reference is made here to social bias, unfairness, and
differential predictive validity. The present study focuses on the question whether the
same score but obtained in different cultural groups has the same meaning across
these groups. Such scores are unbiased. Two types of approaches have been
developed to deal with bias in cognitive tests. The first type, known under various
labels such as culture-free, culture-fair, and culture-reduced testing (Jensen, 1980),
attempts to eliminate or minimize the differential influence of cultural factors, like
education, by adapting instrument features that may induce unwanted score
differences across countries. Raven's Matrices Tests are often considered to
exemplify this approach (e.g., Jensen, 1980). Despite the obvious importance of
good test design, the approach has come under critical scrutiny; it has been argued
that culture and test performance are so inextricably linked that a culture-free test
does not exist (Frijda & Jahoda, 1966; Greenfield, 1997).
Inductive Reasoning 5
Second, various statistical procedures have been proposed to examine the
appropriateness of psychological instruments in different ethnic groups. Examples
are exploratory factor analysis followed by target rotations and the computation of
factorial agreement between ethnic groups (Barrett, Petrides, Eysenck, & Eysenck,
1998; McCrae & Costa, 1997), simultaneous components analysis (Zuckerman,
Kuhlman, Thornquist, & Kiers, 1991), item bias statistics (Holland & Wainer, 1993),
and structural equation modeling (Little, 1997). It is remarkable that a priori and a
posteriori approaches (test adaptations and statistical techniques, respectively) have
almost never been combined, despite their common aim, mutual relevance, and
complementarity.
The present paper attempts to integrate a priori and a posteriori approaches
and takes equivalence as a starting point. Equivalence refers to the similarity of
psychological meaning across cultural groups (i.e., the absence of bias). Three
hierarchical types of equivalence can be envisaged (Van de Vijver & Leung, 1997a,
b). At the lowest level the issue of similarity of a psychological construct, as
measured by a test in different cultures, is addressed. An instrument shows
structural (also called functional) equivalence if it measures the same construct in
each cultural population studied. There is no claim that scores or measurement units
are comparable across cultures. In fact, instruments may be different across
cultures; structural equivalence is supported if it can be shown that in each culture
the same underlying construct (e.g., inductive reasoning) has been measured. The
intermediate level refers to measurement unit equivalence, defined by equal scale
units and unequal scale origins across cultural groups (e.g., the temperature scales
in degrees of Celsius and Kelvin). In practical terms, this type of equivalence is
Inductive Reasoning 6
found when the same instrument has been administered in different groups but
scores are not directly comparable across groups because of the presence of
moderating variables with a bearing on group mean scores, such as intergroup
differences in stimulus familiarity. Structural equation modeling is suitable to address
measurement unit equivalence because it allows for a comparison of score metrics
across cultural groups. The third and highest level is called full score equivalence
and refers to identity of both scale units and origins. Only in the latter case, scores
can be compared both within and across cultures using techniques like t tests and
analyses of (co)variance.
Full score equivalence assumes the complete absence of bias in the
measurement. Score differences between and within cultures are entirely due to
inductive reasoning. There are no fully adequate statistical tests of full score
equivalence, but some go a long way. The first is indirect and involves the use of
additional variables to (dis)confirm a particular interpretation of cross-cultural score
differences (Poortinga & Van de Vijver, 1987). Suppose that Raven’s Standard
Progressive Matrices Test is administered to adults in the U.S.A. and to illiterate
Bushmen. It may well be that the test provides a good picture of inductive reasoning
in both cultures. However, it is likely that differences between the countries are
influenced by educational differences between the groups. Score differences within
and across groups have a different meaning in this case. A measure of testwiseness or previous test exposure, administered to all participants, can be used to
(dis)confirm that cross-cultural score differences are due to bias. Full score
equivalence is then not demonstrated but assumed, and corollaries are tested.
Inductive Reasoning 7
Other tests of full score equivalence that have been proposed, compare the
patterning of cross-cultural score differences across items or subtests, often within
the framework of structural equation modeling. An example is multilevel covariance
structure analysis (Muthén, 1991, 1994) that compares the factor structure of pooled
within-country data to between-country data. Such an analysis assumes a sizable
number of cultural groups involved. Another example involves the modeling of latent
means in a structural model (e.g., Little, 1997). A frequently employed approach,
often based on item response theory, which is applicable when a small number of
cultures have been studied, is the examination of differential item functioning or item
bias (e.g., Holland & Wainer, 1993; Van der Linden & Hambleton, 1997). As long as
the sources of bias (such as education) affect all items in a more or less uniform
way, no statistical techniques will indicate that between-group differences are of a
different nature than within-group differences. Only if bias affects some items, the
proposed techniques can identify it. In sum, the establishment of full score
equivalence is an intricate issue. In many empirical studies dealing with mental tests,
this form of equivalence is merely assumed. As a consequence, statements about
the size of cross-cultural score differences often have an unknown validity.
Sternberg and Kaufman’s (1998) observation that we know that there are population
differences in human abilities, but that their nature is elusive, is very pertinent.
In line with current thinking in validity theory (Embretson, 1983; Messick,
1988), the present study combines test design and statistical analyses to deal with
bias (and equivalence). A distinction is made between internal and external
procedures to establish equivalence, depending on whether the procedure is based
Inductive Reasoning 8
on information derived from the scrutinized test itself (internal) or from additional
tests (external).
The present study examines the structural, measurement unit, and full score
equivalence of a measure of inductive reasoning in three, culturally widely divergent
populations (Zambia, Turkey, and the Netherlands). Structural and measurement
unit equivalence are studied using both an internal and external procedure. The
internal procedure to examine equivalence is based on item-generating rules that
underlie the instruments. In the external procedure, equivalence is scrutinized by
comparing the contribution of skill components to inductive reasoning across
countries. Three components are presumably relevant in the types of inductive
reasoning tasks studied here (Sternberg, 1977). The first is classification: treating
stimuli as exemplars of higher order concepts (e.g., the set CDEF as four
consecutive letters in the alphabet, as an instance of a group with one vowel, as a
group with three consonants, etcetera). Individuals are more successful in inductive
reasoning tasks when they can generate more of these classifications. Therefore, in
addition to classification, the skill to generate underlying rules on the basis of a
stimulus set was also tested. Finally, each generated rule has to be tested (e.g., Do
other groups also have four consecutive letters?). The latter skill, labeled rule
testing, was also assessed.
Inductive Reasoning 9
Method
Participants
An important consideration in the choice of countries was the presumed
strong influence of schooling on test performance (Van de Vijver, 1997); the
expenditure per head on education, a proxy for school quality, is strongly influenced
by national affluence. Countries with considerable differences in school systems and
educational expenditures per child were chosen. Furthermore, inclusion of at least
three different cultural groups decreases the number of alternative hypotheses to
explain cross-cultural differences (Campbell & Naroll, 1972). Zambia, Turkey, and
the Netherlands show considerable differences in educational systems and GDP
(per capita); the GDP figures per capita for 1995 were US$ 382, 2,814, and 25,635
for the three countries, respectively. School life expectancy of the three countries is
7.3, 9.7, and 15.5 year (United Nations, 1999). The choice of Zambia was also made
because of its lingua franca in school; English is the school language in Zambia
which was convenient for developing and administering tasks.
In each country pupils of four subsequent grades were involved. In the
Netherlands these were the last two grades of primary school (Grade 5 and 6) and
the first two grades of secondary school. The same procedure was applied in
Zambia, where primary school has seven grades. In a pilot study it was found that
the tasks could not be adequately administered to pupils from Grade 5 because
most of these children have still an insufficient knowledge of English, that is the first
language of few Zambians. Children start attending primary school in Turkey and the
Netherlands at the age of six, while schooling starts one year later in Zambia; as a
consequence, the Zambian pupils were on average two years older. The Zambian
Inductive Reasoning 10
sample comprised of more than 20 cultural groups (the three largest being Tonga,
21%; Bemba, 13%; and Nyanja, 11%); the Turkish groups was 99% Turkish, while in
the Dutch group 93% were Dutch, 2% Moroccan, and 2% Turkish.
Primary schooling in Turkey has five grades; pupils from the fifth grade of
primary school and the first three grades of secondary school were involved.
Secondary education is markedly different in the three countries. In Zambia a nationwide examination (with tests for reasoning and school achievement) at the end of
the last grade of primary school, Grade 7, is utilized to select pupils for secondary
school. After the seventh Grade less than 20% pupils continue their education in
either public or private secondary schools. Admittance to public schools is
conditional on the score at the Grade 7 Examination. Cutoff scores vary per region
and depend on the number of places available in secondary schools. In urban areas
there are some private schools; admittance to these schools usually does not
depend on examination results, but is mainly dependent on availability of places as
well as the ability and willingness of parents to pay school fees. Participants both
from public and private schools were included in our study. The tremendous dropout
at the end of Grade VII has undoubtedly adversely affects the generalizability of the
data to the Zambian population at large and it also jeopardized the comparability of
the age cohorts, both within Zambia and across the three countries. In Turkey and
the Netherlands secondary schooling is more or less intellectually streamed. An
attempt was made to retain the intellectual heterogeneity of the primary school group
at secondary school level by selecting various types of schools. The intellectual
heterogeneity of the samples is clearly larger in Turkey and the Netherlands than
Inductive Reasoning 11
Zambia; yet, none of the samples may be fully representative for the age groups of
their respective countries.
Insert Table 1 about here
Sample sizes are presented in Table 1; of the participants recruited 56%
came form urban and 44% from rural schools; 46% was female, 54% was male.
Instruments
The battery consisted of eight tasks, four with figures and four with letters as
stimuli. Each of these two stimulus modes had the same composition: a task of
inductive reasoning and three tasks of skill components that are assumed to
constitute important aspects of inductive reasoning. The first is rule classification,
called encoding in Sternberg’s (1977) model of analogical reasoning. The second is
rule generating, a combination of inference and mapping. The third is rule testing, a
combination of comparing and justification.
All tasks are based on item-generating rules, schematically presented in
Appendix A. All figure tasks are based on the following three item-generating rules:
(a) The same number of figure elements is added to subsequent figures in a
period (periods consist of either circles or squares, but never of both. A
period defines the number of figures that belong together. Examples of
items of all tasks, in which the item-generating rules are illustrated, can be
found in Appendix B).
(b) The same number of elements is subtracted from subsequent figures in a
period.
Inductive Reasoning 12
(c) The same number of elements is, alternatingly, added to and subtracted
from subsequent figures in a period.
The three item-generating rules are an example of a facet, a generic term for
all item features that are systematically varied across items. Two more facets
applied to all figure tasks. First, the number of figures in a period varies from two to
four. Second, the number of elements that are added to or subtracted from
successive elements of a period varied from one to three. Whenever possible, all
facet levels were crossed. However, for some combinations of facet levels no item
could be generated. For example, as each figure can have (in addition to a circle or
a square that are present in all items) only five elements (namely a hat, arrow, dot,
line, or bow), it is impossible to construct an item with two or three elements added
to each of four figures in a period.
Inductive Reasoning Figures is a task of 30 items. Each item has five rows of
12 figures, the first eight are identical. One of the rows has been composed
according to a rule while in the other rows the rule has not been applied consistently.
The pupil has to mark the correct row.
Besides the common facets, two additional facets were used to generate the
items of Inductive Reasoning Figures. First, the figure elements added or subtracted
are either the same or different across periods. In the example of Appendix B there
is a constant variation because in each period a dot is added first, followed by a
dash and a hat. Second, periods do or do not repeat one another, meaning that the
first figures of each period are identical (except for a possible swap of circle and
square).
Inductive Reasoning 13
The 36 items of Rule Classification Figures consist of eight figures. Below
these figures the three item-generating rules were printed. In addition, the alternative
"None of the rules applies" has been added. The pupil had to indicate which of the
four alternatives applies to the eight figures above.
In addition to the common facets, the task has three additional facets. The
first two are the same as in Inductive Reasoning Figures. The third refers to the
presence or absence of periodicity cues. These cues refer to the presence of both
circles and squares in an item (as illustrated in the first item of Appendix B) or the
presence of either squares or circles (if all circles of the example would be changed
into squares, no periodicity cues would be present).
Whereas in Inductive Reasoning Figures the number of different elements of
a figure could be either one, two, or three, Rule Classification Figures has another
level of this facet, referring to a variable number of elements. For example, in the
first period one element is added to subsequent figures and in the second period two
elements.
Each of the 36 items of Rule Generating Figures consists of a set of six
figures under which three lines with the numbers 1 to 6 are printed. In each item
one, two, or three triplets (i.e., groups of three figures) have been composed
according to one of the item-generating rules. Any of the six figures of an item can
be part of one, two, or three triplets. Pupils were asked to indicate all triplets that
constitute valid periods of figures. No information about the number of valid triplets
in each particular item was given. The total number of triplets was 63. In the data
analysis these were treated as separate, dichotomously scored items.
Inductive Reasoning 14
Two facets, in addition to the common ones, were included. First, periodicity
cues are either present or absent; the facet has the same meaning as in Rule
Classification Figures. Second, the number of valid triplets is one, two, or three.
A verbal specification is given at the top of each item of Rule Testing Figures.
In this specification three characteristics of the item are given, namely the
periodicity, the item-generating rule, and the number of elements varied between
subsequent figures of a period (e.g., "There are 4 figures in a group. 1 thing is
subtracted from figures which come after each other in a group"). Below this
specification four rows of eight figures have been drawn. One of the rows of eight
figures has been composed completely according to the specification. In some items
none of the four rows has been composed according to the specification. In this
case a fifth response alternative, "None of the rows has been composed according
to the specification" applies. This facet is labeled “None/one of the rules applies”.
The pupil has to mark the correct answer.
The facets and facet levels of Rule Testing Figures and Rule Classification
Figures were identical. In addition, the facet “Rows (do not) repeat each other” is
included in the former task. In some items the rows are fairly similar to each other
(except for minor variations that were essential for the solution), while in other items
each row has a completely different set of eight figures.
The letter tasks were based on five item-generating rules:
(a) Each group of letters has the same number of vowels. The vowels used in
the task are A, E, I, O, and U. As the status of the letter Y can easily
create confusion in English and Dutch where it can be both a consonant
Inductive Reasoning 15
and a vowel, the letter was never used in connection to the first itemgenerating rule;
(b) Each group of letters has an equal number of identical letters that are the
same across groups (e.g., BBBB BBBB);
(c) Each group of letters has an equal number of identical letters that are not
the same across groups (e.g., GGGG LLLL);
(d) Each group of letters has a number of letters that appear the same (i.e., 1,
2, 3, or 4) number of positions after each other in the alphabet. The letters
A and B have a difference of one position, the letters A and C a difference
of two positions, etcetera;
(e) Each group of letters has a number of letters that appear the same (i.e., 1,
2, 3, or 4) number of positions before each other in the alphabet.
A second facet refers to the number of letters to which the rule applies. The number
could vary from 1 to 6. All items of the letter tasks are based on a combination of the
two facets described (i.e., item rule and number of letters). Like in the figure tasks,
not all combinations of the facets are possible; for example, applications of the
fourth and fifth rule assume an item rule that is based on at least two letters in a
group.
Inductive Reasoning Letters bears resemblance to the Letter Sets Test in the
ETS-Kit of Factor-Referenced Tests (Ekstrom, French, & Harman, 1976). Each of
the 45 items consists of five groups of six letters. Four out of these five groups are
based on the same combination of the two facets (e.g., they all have two vowels).
The pupil has to mark the odd one out.
Inductive Reasoning 16
Each of the 36 items of Rule Classification Letters consists of three groups of
six letters. Most items have been constructed according to some combination of the
two facets, while in some items the combination has not been used consistently. The
five item-generating rules are printed under the groups of letters. The pupil has to
indicate which item-generating rule underlies the item. If the rule has not been
applied consistently, the sixth response alternative, "None of the rules applies," is
the correct answer. Like in Rule Classification Figures, the facet “Non/one of the
rules applies” is included.
Each of the 30 items of Rule Generating Letters consists of a set of 6 groups
of letters (each made up of 1 to 9 letters). In each item up to five triplets of (subsets
of) the groups of letters have been composed according to a combination of the two
facets. The pupil is asked to indicate as many triplets as possible. Again, the pupils
were not informed about the exact number of triplets in each item. The total number
of triplets is 90, which are treated as separate items in the data analysis.
Like in Rule Generating figures, a facet about the number of valid triplets
(ranging here from 1 to 5) applies to the items, in addition to the common facets.
Rule Testing Letters consists of 36 items. Each item starts with a verbal
specification. In the specification two characteristics of the item are given, namely
the item-generating rule and the number of letters pertaining to the rule (e.g., "In
each group of letters there are 3 vowels"). Below this specification four rows of three
groups of six letters are printed. In most items one of the rows figures has been
composed according to the specification. The pupil has to find this row. In some
items none of the four rows has been composed according to the specification. In
this case a fifth response alternative, "None of the rows has been composed
Inductive Reasoning 17
according to the specification above the item," applies. The pupil has to indicate
which of the five alternatives applies.
The task had three facets, besides the common ones. Like in Rule
Classification Letters, there is a facet indicating whether or not one of the
alternatives follows the rule. Also, like in Rule Testing Figures, there is a facet
indicating whether or not rows repeat each other.
The Turkish and Roman alphabet are not entirely identical. The (Roman)
letters Q, W, and X were not used here since they are uncommon in Turkish. The
presence of specifically Turkish letters, such as Ç, Ö, and Ü, necessitated the
introduction of small changes in the stimulus material (e.g., the sequence ABCD in
the Zambian and Dutch stimulus materials was changed into ABCÇ in Turkish).
Administration
The tasks were administered without time limit to all pupils of a class;
however, in the rural areas in Zambia the number of desks available was often
insufficient for all pupils to work simultaneously as each pupil had to have his or her
own test booklet and answer sheet. The experimenter then selected randomly a
number of participants.
The tasks were administered by local testers. The number of testers in
Zambia, Turkey, and the Netherlands were two, three, and two (the author being one
of them), respectively. Five were psychologists and three were experienced
psychological assistants. All testers followed a one-day training in the administration
of the tasks.
Inductive Reasoning 18
In Zambia English was used in the administration. A supplementary sheet in
Nyanja, the main language of the Lusaka region, was included in the test booklet
that explained the item-generating rules. Turkish was the testing language in the
Turkish group Turkish and Dutch in the Dutch group.
The administration of all tasks to all pupils would presumably have taken
three school days. In order to avoid the loss of motivation and test fatigue, two
experimental conditions were introduced: the figure and the letter condition. The two
tasks of inductive reasoning were always administered; in the figure condition rule
classification, generating, and rule testing tasks with figures were also included,
while the three additional letter tasks were administered in the letter condition;
sample sizes for each condition are given in Table 1. So, all pupils received five
tasks: two tasks of inductive reasoning and three tasks of skill components (either
the three figure or the three letter skill component tasks). The administration of the
five tasks took place on two consecutive school days. The order of administration of
the tasks was random, with the constraint that the two tasks of inductive reasoning
were given on one day (either the first or the second testing day) and the remaining
on the other one.
The description of all eight instruments started with a one-page description of
the task, which was read aloud by the experimenter to the pupils; item-generating
rules of the stimulus mode were specified. This instruction was included in the
pupils’ test booklets. Examples were then presented of each of the item-generating
rules; explicit reference was made to which rule applied. Finally, the pupils were
asked to answer a number of exercises that again, covered all item-generating rules.
After this instruction, the pupils were asked to answer the actual items. In each
Inductive Reasoning 19
figure task the serial position of each figure was printed on top of the item in order to
minimize the computational load of the task. The alphabet was printed at the top of
each page of the letter tasks, with the vowels underlined. It was indicated to the
pupils that they were allowed to look back at the instructions and examples (e.g., to
consult the item-generating rules). Experience showed that this was infrequently
done, probably because all tasks of a single stimulus mode utilized the same rules.
Results
The section begins with a description of preliminary analyses, followed by the
main analyses. Per analysis, the hypothesis, statistical procedure and findings are
reported.
Preliminary Analyses
The internal consistencies of the instruments (Cronbach’s alpha) were
computed per culture and grade. Inductive Reasoning Figures showed an average
of .86 (range: .79-.93), Rule Classification Figures .83 (.71-.90), Rule Generating
Figures .89 (.84-.95), Rule Testing Figures .85 (.81-.89), Inductive Reasoning Letters
.79 (.69-.88), Rule Classification Letters .83 (.73-.90), Rule Generating Letters .93
(.90-.95), and Rule Testing Letters .78 (.63-.85). Overall, the internal consistencies
yielded adequate values. Country differences were examined in a procedure
described by Hakstian and Whalen (1976). Data of all grades were combined. The
M statistic, that follows a chi square distribution with two degrees of freedom, was
significant for Inductive Reasoning Figures (M = 64.92, p < .001), Rule Classification
Figures (M = 10.57, p < .01), Rule Generating Figures (M = 34.06, p < .001), Rule
Classification Letters (M = 11.57, p < .01), and Rule Testing Letters (M = 12.40, p <
Inductive Reasoning 20
.01). The Dutch group tended to have lower internal consistencies (a possible
explanation is given later).
Insert Table 2 and 3 about here
The average proportions of correctly solved items per country, grade, and
task are given in Table 2. Differences in average scores were tested in a multivariate
analysis of variance with country (3 levels; Zambia, Turkey, and the Netherlands),
grade (4 levels; 5 through 8), and gender (2 levels). Separate analyses were carried
out for the letter and figure mode. It may be noted that the present analysis is
presented merely for exploratory purposes to give insight in the relative contribution
of each factor to the overall score variation; conclusions about country differences in
inductive reasoning or its components are premature until full score equivalence of
scores across countries has been shown. Table 3 gives the estimated effect sizes
(proportion of variance accounted for). The results were essentially similar for the
two modes. Country was highly significant (p < .001) in all tasks, usually explaining
more than 10%. Zambian pupils tended to show the lowest scores and Dutch pupils
the highest scores. Grade differences were as expected; as can be confirmed in
Table 2, scores increased with grade. The effect sizes were substantial, usually
larger than 10%, and highly significant for all tasks (p < .001). Gender differences
were small; significant differences were found for Rule Testing Figures and Inductive
Thinking Letters (girls scored higher on both tasks), but gender differences did not
explain more than 1% on any task. The country by grade interaction was significant
in all analyses, explaining between 1 and 5%. As can be seen in Table 2, score
Inductive Reasoning 21
increases with grade tended to be smaller in the Netherlands than in the other two
countries. Country differences in scores were large in all grades but tended to be
become smaller with age. These results are in line with a meta-analysis (Van de
Vijver, 1997) in which in the age range examined here, there was no increase of
cross-cultural score differences with age (contrary to what would be predicted from
Jensen’s, 1977, cumulative deficit hypothesis). Other interactions were usually
smaller and often not significant.
Structural Equivalence in Internal Procedure
Hypothesis. The first hypothesis addresses equivalence in internal
procedures by examining the decomposition of the item difficulties. The hypothesis
states that facet levels provide an adequate decomposition of the item difficulties of
each task in each country (Hypothesis 1a). See Table 4 for an overview of the
hypotheses and their tests.
Statistical procedure. Structural equivalence of all tests is examined using the linear
logistic model (LLM) (Fischer, 1974, 1995). It is an extension of the Rasch model,
which is one of the frequently employed models in item response theory. The Rasch
model holds that the probability that a subject k (k = 1, …, K) responds correctly to
item i is given by
exp(k - i)/[1+ exp(k - i)],
(1)
in which k represents the person’s ability and i the item difficulty. An item is
represented by only one parameter, namely its difficulty (unlike some other models
Inductive Reasoning 22
in item response theory in which each item also has a discrimination parameter,
sometimes in addition to a pseudo-guessing parameter). A sufficient statistic for
estimating a person’s ability is the total number of correctly solved items on the task.
Analogously, the number of correct responses at an item provides a sufficient
statistic for estimating the item difficulty. For our present purposes the main interest
is in item parameters.
The LLM imposes a constraint on the item parameter by specifying that the
item difficulty is the sum of an intercept  (that is irrelevant here) and a sum of
underlying facet level difficulties, j:
i = +  qij j
(2)
The second step aims at estimating the facet level difficulties (). Suppose that the
item is “BBBBNM BBBBKJ BBBBHJ BBFTHG BBBBHN”. In terms of the facets, the
item can be classified as involving (a) four letters (facet: number of letters); (b) equal
letters within and across groups of letters (facet: item rule). The above model
equation (2) specifies that the item parameter will be the sum of an intercept, two
facet level difficulties (namely the difficulty parameter of items dealing with four
letters and the difficulty parameter of items dealing with equal letters within and
across groups of letters), and a residual component.
The matrix Q (with elements qij) defines the independent variable; the matrix
has m rows (the number of items of the task) and n columns (the number of
independent facet levels of the task). Entries of the Q matrix are zero or one
depending on whether the facet level is absent or present in the item (interactions of
Inductive Reasoning 23
facets were not examined). In order to guarantee uniqueness of the parameter
estimates in the LLM, linear dependencies in the design matrix were removed by
leaving the first level of each facet out of the design matrix. This (arbitrary) choice
implied that the first level of each facet has a difficulty level of zero and that the size
and significance of other facet levels should be interpreted relative to this “anchor.”
The sufficient statistic for estimating the basic parameters is the number of
correct answers at the items that make up the facet level. As a consequence, there
will be a perfect rank order between this number of correct answers and j. Various
procedures have been developed to estimate the basic parameters. In the present
study conditional maximum likelihood estimation was used (details of this
computationally rather involved procedure are given by Fischer, 1974, 1995). An
important property of the LLM is the sample independence of its parameters;
estimates of the item difficulty and the basic parameters are not influenced by the
overall ability level of the pupils. This property is attractive here because it allows for
their estimation, even when average scores of cultural groups differ.
The LLM is a two-step procedure; the first consists of a Rasch analysis.
Estimates of item () and person () parameters of equation (1) are obtained. In the
second step the parameters of equation (2) are estimated. The item parameters are
used in the evaluation of the fit of the model. The fit of an LLM can be evaluated in
various ways. First, a likelihood ratio test can be computed, comparing the likelihood
of the (unrestricted) Rasch model to the (restricted) LLM. The statistic is of limited
value here. The ratio is affected by guessing (Van de Vijver, 1986). Because all
tasks employed a multiple-choice format, it is unrealistic to assume that a Rasch
model would hold. The usage of an LLM may seem questionable here because of
Inductive Reasoning 24
the occurrence of guessing (pupils were instructed to answer all items). However,
Van de Vijver (1986) has shown that guessing gives rise to a reduction of the
variance of the estimated person and item parameters but correlations of both
estimated parameters with their true values are hardly affected. A useful heuristic to
evaluate the degree of fit of the LLM is provided by the correlation between the
Rasch parameters of the first step of the analysis and the by means of the design
matrix reconstructed item parameters of the second step. It amounts to correlating
item parameters of the first step () (the “unfaceted item difficulties”) with the item
parameters of the second step, using i* =  qij j (the “faceted item difficulties”). The
latter vector gives the item parameters estimated on the basis of the estimated facet
level difficulties. Higher correlations point to a better approximation of item level
difficulties by facet level difficulties and hence, to a better modelability of inductive
reasoning.
Every task has its own design matrix, consisting both of facets that were
common to all tasks of a mode (e.g., the item-generating rules) and task-specific
facets (e.g., the number of correct answers in the rule generating tasks). The
analyses were carried out per country and grade. The item difficulties (with different
values per country and grade) and the Q matrix (invariant across grades and
countries for a specific task) were input to the analyses. This procedure was
repeated for each task, making a total of 8 (tasks) x 4 (grades) x 3 countries = 96
analyses.
The LLM is applied here as one of two tests of structural equivalence. This
type of equivalence addresses the relationship between measurement outcomes
and the underlying construct. The facets of the tasks are assumed to influence the
Inductive Reasoning 25
difficulty of the items. For example, it can be expected that rules in items of the letter
tasks are easier when they involve more letters. The analysis of structural
equivalence examines whether the facets exert an influence on item difficulty in each
culture. In more operational terms, structural equivalence is supported if the
correlation of each analysis is significantly larger than zero. A significant correlation
points to a contribution of the facet levels to the item difficulty: It indicates that the
facet levels contribute to the prediction of item difficulties.
Insert Table 5 about here
Hypothesis test. As can be seen in Table 5, the correlations between the
unfaceted Rasch item parameters (of equation 1) and the faceted item parameters
(of equation 2) were high for all tasks in each grade in each country. These high
correlations provide powerful evidence that the same facets influence item difficulty
in each country. It can be concluded that Hypothesis 1a, according to which the item
difficulty decomposition would be adequate in each country was strongly supported.
Insert Table 6 about here
The second question involves the patterning of the correlations of Table 5.
This question was addressed in an analysis of variance, with country (3 levels:
Zambia, Turkey, and the Netherlands), stimulus mode (2 levels: figure and letters
tests), and type of skill (4 levels: inductive reasoning and each of the three skill
component tasks) as independent variables; the correlation was the dependent
Inductive Reasoning 26
variable. The four grades were treated as independent replications. As can be seen
in Table 6, all main effects and first order interactions were significant. A significantly
lower correlation (and hence a poorer fit of the data to the model) was found for
figure tasks than for letter tasks (averages of .87 and .91, respectively), F(2, 72) =
24.85, p < .001. The effect was considerable, explaining 22% of the total score
variation. About the same percentage was explained by skill components, F(3, 72) =
37.94, p < .001. The lowest correlation (of .87) was obtained for rule classifications
and rule generating (.87), followed by inductive reasoning (.89) while rule testing
showed the highest value (.93). The high values of the latter may be due to a
combination of a large number of items (long tests) and the presence of both very
easy and difficult facet levels in the rule testing tasks in each country; such facet
levels increase the dispersion and will give rise to high correlations. Country
differences explained about 10% of the score variation; the correlations of the
Turkish and Zambian groups were very close to each other (.91 and .90,
respectively), while the value for the Dutch group was .87. A closer inspection of the
data revealed that the largest differences between the countries were found for
tasks with relatively high proportions of correctly solved items. Therefore, the
difference in fit may be due to ceiling effects in the Dutch group, which by definition
reduce the correlation. The most important interaction, explaining 16% of the total
score variation, was observed between country and stimulus mode, F(2, 72) = 19.92,
p < .001. Whereas the correlations did not differ more than .03 for both stimulus
modes in Zambia and Turkey, the difference in the Dutch sample was .09.
The interaction of stimulus mode and skill component was also significant,
F(3, 72) = 9.61, p < .001. Correlations of rule generating and rule testing were on
Inductive Reasoning 27
average .03 larger than in the figure mode than in the letter mode, while a much
larger value of .08 was observed for rule classification. The interaction of country
and skill was significant though less important (explaining only 5%). The score
differences of the cultures were relatively small for rule generating and rule testing
and much larger for inductive reasoning and rule classification, mainly due to the
relatively low scores of the Dutch. Again, ceiling effects may have induced the effect
(not necessarily the largest attainable score because some facet levels remained
beyond the reach of many pupils even the highest scorers). In sum, the analysis of
the correlations revealed high values for all tasks in the three countries. The
observed country differences were presumably more due to ceiling effects than to
country differences in modelability of inductive reasoning and its components.
Ceiling effects may also explain the lower internal consistencies in the Dutch data,
discussed before.
Insert Figure 1 and 2 about here
The estimated facet level difficulties (the estimated values of , cf. equation
2) are of all tests have been drawn in Figure 1 (figure tests) and 2 (letters tests).
Higher values of  refer to more difficult facet levels. The most striking finding of
both Figures is the proximity of the three country curves; this points to the crosscultural similarity in pattern of difficult and easy facet levels, which yields further
evidence for the structural equivalence of the instruments in the present samples.
Furthermore, most facet levels behaved as expected. As for the figure tasks, the
third item-generating rule (about alternating additions and subtractions) was
Inductive Reasoning 28
invariably the most difficult. Items were more difficult when they dealt with shorter
periods, when a variable number of elements were added or subtracted in
subsequent figures of a period, when periodicity cues were absent, and when
periods did not repeat each other. The number of valid triplets (only present in the
rule-generating task) showed large variation. Pupils found it relatively easy to
retrieve one correct solution, but relatively difficult to find all solutions when the item
contained two or three valid triplets.
The difficulty patterning of the letter tasks also followed expectation. Dealing
with equal letters was easier than dealing with positions in the alphabet. Items about
equal letters within and across groups (e.g., BBBBBB BBBBBB) were easier than
items about letters that were equal within and unequal across groups (e.g., BBBBBB
GGGGGG). Items were easier when the underlying rule involved more letters (which
facilitates recognition). Items about positions in the alphabet (the last two itemgenerating rules of the letter mode) were easier when they involved smaller jumps
(e.g., ABCD was easier to recognize as a group of letters in which the position of
letters in the alphabet is important than ACEG, that was easier to recognize than
ADGJ). Like in the generating task of the figure mode, a strong effect of the number
of valid triplets was found. Finding all solutions turned out to be difficult and valid
triplets were often overlooked.
Measurement Unit Equivalence in Internal Procedure
Hypothesis. For each task the same facet level difficulties apply in each
country (Hypothesis 1b; cf. Table 4).
Inductive Reasoning 29
Statistical procedure. The LLM parameters can also be used to test
measurement unit equivalence. This type of equivalence goes beyond structural
equivalence by assuming that the tasks as applied in the three countries have the
same measurement units (but not necessarily the same scale origins). If the
estimated  parameters of equation 2 are invariant across countries except for
random fluctuations, there is strong evidence for the invariance of the measurement
units of the test scales. This invariance would imply that the estimated facet level
difficulties in a particular country could be replaced by the difficulty of the same facet
in another country without affecting the fit of the model. For these analyses the data
for the grades in a country were combined because of the primary interest in country
differences.
Hypothesis test. Standard errors of the estimated facet level difficulties
ranged from 0.05 to 0.10. As can be derived from Figure 1 and 2, in each task there
are facet levels that differ significantly across countries. It can be safely concluded
that scores did not show complete measurement unit equivalence.
Yet, it is also clear from these Figures that some facet levels are not
significantly different across countries. So, the question arises to what extent facet
levels are identical across countries. The question was addressed using intraclass
correlations, measuring the absolute agreement of the estimated facet level
difficulties in the three countries. The absolute agreement of the estimated basic
parameters of a single task across countries was evaluated; per task the intraclass
correlation of the country by facet level matrix was computed. The letter tasks
showed consistently higher values than the figure tasks. The average agreement
coefficient was .91 for the figure tasks and .96 for the letter tasks (all intraclass
Inductive Reasoning 30
correlations were significantly above zero, p < .001). The high within-task agreement
points to an overall strong agreement of facet levels across countries. The estimated
facet level difficulties come close to being interchangeable across countries (despite
the significant differences of some facet levels).
A recurrent theme in the analysis is the better modelability of the letter tasks
as compared to the figure tasks, due to wider range of facet level difficulties in the
letter than in the figure mode. The range differences may be a consequence of the
choices made in the test design stage. One of the problems of many existing figure
tests is their often implicit definition of permitted stimulus transformations (e.g.,
rotating and flipping). This lack of clarity, presumably an important source of crosscultural score differences, was avoided in the present study by spelling out all
permitted transformations in the test instructions. Apparently, the price to be paid for
providing the pupils with this information is a small variation in facet level difficulties.
Structural Equivalence in External Procedure
Hypothesis. The skill components contribute to inductive reasoning in each
country (Hypothesis 2a; cf. Table 4).
Insert Figure 3 about here
Statistical procedure. External procedures to establish equivalence scrutinize
the relationships between inductive reasoning and its componential skills. A specific
type of structural equation model was used, namely a MIMIC model (Multiple
Indicators MultIple Causes; see Van Haaften & Van de Vijver, 1996, for another
Inductive Reasoning 31
cross-cultural application). A MIMIC is a model that links input and output through
one latent variable (see Figure 3). The core of the model is the latent variable,
labeled inductive reasoning. This variable, , is measured by the two tasks of
inductive reasoning (the output variables). The input to the inductive reasoning factor
comes from the skill components; the components are said to influence the latent
factor and this influence is reflected in the two tasks of inductive reasoning. In sum,
the MIMIC model states that inductive reasoning is measured by two tasks (IRF and
IRL) and is influenced by three components (classification, rule generating, and rule
testing). The model equations are as follows:
y1 = 1  + 1;
(3)
y2 = 2  +  2,
in which y1 and y2 denote observed scores on the two tasks of inductive thinking, 1
and 2 the factor loadings, and 1 and 2 error components. The latent variable, , is
linked to the skill components in a linear regression function:
 = 1 x1 + 2 x2 +3 x3 + ,
(4)
where the gammas are the regression coefficients, the x-variables refer to the skill
components, and is the error component. In order to make the estimates
identifiable, the factor loading of IRF, 1, was fixed at one.
An attractive feature of structural equation modeling is its allowance for
multigroup analysis. This means that the adequacy of the above model equations for
the data can be evaluated for all 12 data sets (4 grades x 3 countries) at once. The
fit statistics yield an overall assessment, covering all data sets.
Inductive Reasoning 32
The theoretical model underlying the study stipulates that the three skill
components constitute essential elements of inductive reasoning. In terms of the
MIMIC analysis, this means that structural equivalence would be supported by a
good fit of a model with three input and two output variables as described. Nested
models were analyzed. In the first step all parameters were held fixed across data
sets, while in subsequent steps similarity constraints were lifted in the following order
(cf. Table 7): the error variance (unreliability) of the tasks of inductive reasoning, the
intercorrelations of the tasks of skill component, the error variance of the latent
variable, the regression coefficients, and the factor loadings. The order was chosen
in such a way that invariance of relationships involving the latent variable (i.e.,
regression coefficients and factor loadings) was retained as long as possible. More
precisely, structural equivalence would be supported when the MIMIC model with the
fewest equality constraints across countries shows a good fit and all MIMIC
parameters differ from zero (hypothesis 2a). It would mean that the tasks of
inductive reasoning constitute a single factor that is influenced by the same skill
component in each analysis (the possibility that there is a good fit but that some
regression coefficients or factor loadings are negative is not further considered here
because no covariances were negative).
Insert Table 7 about here
Hypothesis test. The relationship of the skill components and inductive
reasoning tasks was examined in a MIMIC model (see Table 7; more details are
given in Appendix C). Nested models were fitted to the data of both stimulus modes.
Inductive Reasoning 33
The choice of a MIMIC model was mainly based on the relatively large change of all
fit statistics when constraints were imposed on the phi matrices (the covariance
matrices of the component skills; see the figure of Appendix B); therefore, the model
with equal factor loadings, regression coefficients, and error variances was chosen.
Although the letter tasks showed a better fit than the figure tasks, the choice of a
model was less straightforward. A MIMIC model with a similar pattern of free and
fixed parameters in both stimulus modes was chosen, mainly because of parsimony
(see footnote to Table 7 for a more elaborate explanation).
The standardized solution of the two models is given in Figure 3. As
hypothesized, all loadings and regression coefficients were positive and significant
(p < .01). It can be concluded that inductive reasoning with figure and letter stimuli
involves the same components in each country. This supports structural
equivalence, as predicted in hypothesis 2a. The regression coefficients of the figure
component tasks were unequal to each other: rule classification was least important,
followed by rule generating, while rule testing showed the largest contribution to
inductive reasoning. The letter mode did not show this patterning; the regression
coefficients of the component tasks of the letter mode were rather similar to one
another.
Measurement Unit Equivalence in External Procedure
Hypothesis. The skill components contribute in the same way to inductive
reasoning in each country (Hypothesis 2b; cf. Table 4).
Statistical procedure. Measurement unit equivalence can be scrutinized by
introducing and testing equality constraints in the MIMIC model. This type of
Inductive Reasoning 34
equivalence would be supported when a single MIMIC model with identical
parameter values holds in all countries. It may be noted that this test is stricter than
the ones proposed in the literature. Whereas the latter tend to analyze all tasks in a
single exploratory or confirmatory factor analysis, more specific relationships
between the tasks are considered here.
Hypothesis test. The psychologically most salient elements of the MIMIC, the
factor loadings, regression coefficients, and the explained variance of the latent
variable, were found to be invariant across countries. However, measurement unit
equivalence also requires the other parameter matrices to be invariant. In the figure
mode the model with equality constraints for all matrices showed a rather poor fit,
with an NNFI of .88, a GFI of .96, and an RMSEA of .045. An inspection of the delta
chi square values indicated that in particular the introduction of equality of
covariances of the skill components () reduced the fit significantly. The letter tasks
showed a similar picture; the most restrictive model revealed values of .89 for the
NNFI, .82 for the GFI, and .041 for the RMSEA, which can be interpreted as a rather
poor fit. Again, equality of the matrices led to a significant reduction of the fit. Like
in our internal procedure to examine measurement unit equivalence, we found some
but inconclusive evidence for the measurement unit equivalence of the task scores
across countries; hypothesis 2b had to be rejected.
Full Score Equivalence
Hypothesis. Both tasks of inductive reasoning show full score equivalence
(Hypothesis 3; cf. Table 4).
Inductive Reasoning 35
Statistical procedure. Full score equivalence can be examined in an item bias
analysis. A logistic regression model was applied to analyze item bias (Rogers &
Swaminathan, 1993). Advantages of the model are the possibility to include more
than two groups and to examine both uniform and nonuniform bias (Mellenbergh,
1982). The combined samples of the three countries are used to determine cutoff
scores that split up the sample in three score level groups (low, medium, and high)
of about the same size. In the logistic regression procedure, culture (dummy coded),
score level, and their interaction are the independent variables, while the item
response is the dependent variable. A significant main effect of culture points to
uniform bias: individuals from at least one country show an unexpectedly low or high
score across all score levels on the item as compared to individuals with the same
test score from other cultures. A significant interaction points to nonuniform bias: the
systematic difference of the scores depends here on the score level; for example,
country differences in scores among low scorers are not found among high scorers.
Alpha was set at a (low) level of .001 in the item bias analyses in order to prevent
inflation of Type I errors, due to multiple testing (although, obviously, the power of
the procedure is adversely affected by this choice).
Insert Figure 4 about here
Hypothesis test. In the introduction two approaches were mentioned to
examine full score equivalence that are based on structural equation modeling:
multilevel covariance structure analysis and the modeling of latent means. The
former could not be used due to the small number of countries involved, while the
Inductive Reasoning 36
latter was precluded because of the incomplete support of measurement unit
equivalence. This lack of support indeed prohibits any analysis of full score
equivalence. Yet, because the bias analysis yielded interesting results, it is reported
here for exploratory purposes. Of the 30 items of the IRF, 15 items were biased (13
items uniform, 11 items non-uniform), mainly involving the Dutch—Zambian
comparison. The occurrence of bias was related to the difficulty of the items; both
the easiest and most difficult items showed the most bias. The correlation between
the presence of bias (0 = absent, 1 = present) and the deviance of the item score
from the mean (i.e., average item score - overall average) was .64 (p < .001). The
correlation suggests a methodological artifact, such as floor and ceiling effects. This
was confirmed by an inspection of the contingency tables underlying the logistic
regression analyses. Figure 4 depicts empirical item characteristic curves of two
items that showed both uniform and nonuniform bias. The upper panel shows a
relatively easy item (with an overall mean of .79) and the lower panel a relatively
difficult item (mean of .33). The bias for the easy item is induced by country
differences at the lowest score level that are not reproduced at the higher levels.
Analogously, the scores for the difficult item remain close to the guessing level (of
.20) in the two lowest score levels, while there is more score differentiation in the
highest scoring group. The score patterns of Figure 4 were found for several items. It
appears that ceiling and floor effects led to item bias in the IRF.
Three items were found to be biased in the IRL (one uniform and two
nonuniform). The Zambian pupils showed relatively high scores on these items.
Because the items were few and involved different facet levels, the reasons for the
bias were not understood, a fairly common finding in item bias research (cf. Van de
Inductive Reasoning 37
Vijver & Leung, 1997b). Floor and ceiling effects did not occur, which points to an
important difference between the two tasks of inductive reasoning; whereas at the
IRF pupils tended to answer items either with a very low of a very high level of
accuracy, pupil scores at the IRL varied more gradually. Similarly, in the IRF there
were no facet levels that were either too difficult or too easy for most of the sample,
but both types of facet levels were present in the IRL.
Discussion
The equivalence of two tasks of inductive reasoning was examined in a crosscultural study involving 632 Dutch, 877 Turkish, and 704 Zambian pupils from the
highest two grades from primary and the lowest two grades from secondary school.
Two stimulus modes were examined: letters and figures. In each mode tasks for
inductive reasoning and for each of its components, classification, generation, and
testing, were administered. The structural, measurement unit, and full score
equivalence of the instruments in these countries were studied. A MIMIC model was
fitted to the data, linking skill components to inductive reasoning through a latent
variable, labeled inductive reasoning (external procedure). A linear logistic model
was utilized to examine to what extent in each country item difficulties could be
adequately decomposed into the underlying rules that were used to generate the
items (internal procedure). In keeping with past research, structural equivalence was
strongly supported; yet, measurement unit equivalence was not fully supported. It is
interesting to note that two different statistical models, item response theory (LLM)
and structural equation modeling (MIMIC), looking at different aspects of the data
(facet level difficulties in the LLM and covariances of tasks with componential skills)
yielded the same conclusion about measurement unit equivalence.
Inductive Reasoning 38
Critics might argue that the emphasis on equivalence of the present study is a
misnomer and detracts the attention form the real cross-cultural differences
observed with these carefully constructed instruments. In this line of reasoning the
results would show massive differences in inductive reasoning across countries, with
Zambian pupils having the lowest skill level, Turkish pupils having an intermediate
position, and Dutch pupils having the highest level. The validity of this conclusion is
underscored by the LLM analyses in which it was found that most facet level
difficulties are identical and interchangeable across the three countries while a small
number is country dependent. Even if score comparisons are restricted to the facet
levels with the same difficulties, at least some of the score differences of the
countries are likely to remain. In this line of reasoning the current study has
demonstrated the presence of at least some but presumably large differences in
inductive reasoning, with Western subjects showing the highest skill levels. In my
view the interpretation is based on a simplistic and untenable view on country score
differences. These differences are not just a matter of differences in inductive
reasoning skills. It may well be that differences of country scores on the tasks are
partly or entirely due to additional factors. Various educational factors may play a
role here, as is often the case in comparisons of highly dissimilar cultural groups. In
a meta-analysis Van de Vijver (1997) has found that educational expenditure is a
significant predictor of country differences in mental test performance. Does quality
of schooling have an influence on inductive reasoning? I concur with Cole (1996),
who after reviewing the available cross-cultural evidence, concluded that schooling
does not have a formative influence on higher-order forms of thinking but tends to
broaden the domains in which these skills can be successfully applied. Schooling
Inductive Reasoning 39
facilitates the usage of skills by their training and by exposure to psychological and
educational tests (cf. Rogoff, 1981; Serpell, 1993). The educational differences of
the populations of the current study are massive. For example, attending
kindergarten is more common in Turkey and the Netherlands than in Zambia, and
schools in Zambia have a fraction of the learning material that schools in Turkey and
the Netherlands have at their disposal. The interpretation of the country differences
observed in the present study as reflecting real differences is based on an
underestimation of the impact of various context-related (educational) factors and an
overestimation of ability of the tasks employed here to measure inductive reasoning
in all countries. Tasks that capitalize less on schooling and are more derived from
everyday experiences may show a different patterning of country differences.
The present results replicate the findings of many studies on structural
equivalence; strong support was found that the instruments measure inductive
reasoning in the three countries. The present results make it very unlikely that there
are major cross-cultural differences in strategies and processes involved in inductive
reasoning in the populations studied. These results extend findings of numerous
factor analytic studies in showing that skill components contribute in a largely
identical way to inductive thinking and item difficulty is governed by complexity rules
that are largely identical across cultures.
The results also show that comparisons of scores obtained in different
countries are not allowed, despite the careful item construction process. This
negative finding on the numerical comparability may be due to the large cultural
distance of the countries involved here. However, it also points to the need to
address measurement unit and full score equivalence in cross-cultural research.
Inductive Reasoning 40
Cross-cultural comparisons of data that have not been scrutinized for equivalence,
abound in the literature. In fact, it is rather difficult to find examples of data in which
the equivalence was examined in an appropriate way. Scores are often numerically
compared across cultures (assuming full score equivalence) when only structural
equivalence has been demonstrated; examples can be found in the cross-cultural
comparisons of the Eysenck personality scales (e.g., Barrett et al., 1998). It is
difficult to defend the practice to compare scores across cultures when equivalence
has not been tested or when only structural equivalence has been observed. The
present study underscores the need to study equivalence of data before comparing
test scores. A more prudent treatment of cross-cultural score differences is badly
needed. We have firmly established the commonality of basic cognitive functions in
several cultural and ethnic groups (Waitz’s “psychic unity”), but we still have to come
to grips with the question of how to design cognitive tests that allow for numerical
score comparisons across a wide cultural range.
A final issue concerns the external validity of the present findings: To what
populations can the present results be generalized? The three countries involved in
the study have a highly different status on affluence. Given the strong findings on
structural equivalence, it is realistic to assume that inductive reasoning is a universal
with largely identical components in schooled populations, at least as of the end of
primary school. Future studies should address the question of whether
measurement unit equivalence would be fully supported when the cultural distance
between the countries is smaller.
Inductive Reasoning 41
References
Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The
Eysenck Personality Questionnaire: An examination of the factorial similarity of P, E,
N, and L across 34 countries. Personality and Individual Differences, 25, 805-819.
Campbell, D. T., & Naroll, R. (1972). The mutual methodological relevance of
anthropology and psychology. In F. L. K. Hsu (Ed.), Psychological anthropology.
Cambridge, MA: Schenkman.
Carroll, J. B. (1993). Human cognitive abilities. A survey of factor-analytic
studies. Cambridge: Cambridge University Press.
Claassen, N. C., & Cudeck, R. (1985). Die faktorstruktuur van die Nuwe SuidAfrikaanse Groeptoets (NSAG) by verskillende bevolkingsgroepe [The factor
structure of the New South African Group Test (NSAGT) in various population
groups.]. South-African Journal of Psychology, 15, 1-10.
Cole, M. (1996). Cultural psychology: A once and future discipline.
Cambridge, MA: Harvard University Press.
Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Kit of factorreferenced tests. Princeton, NJ: Educational Testing Service.
Ellis, B. B. (1990). Assessing intelligence cross-nationally: A case for
differential item functioning detection. Intelligence, 14, 61-78.
Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item response theory
evaluation of an English version of the Trier Personality Inventory (TPI). Journal of
Cross-Cultural Psychology, 24, 133-148.
Embretson, S. E. (1983). Construct validity: Construct representation versus
nomothetic span. Psychological Bulletin, 93, 179-197.
Inductive Reasoning 42
Fan, X., Willson, V. L., & Reynolds, C. R. (1995). Assessing the similarity of
the factor structure of the K-ABC for African-American and White children. Journal of
Psychoeducational Assessment, 13, 120-131.
Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests
[Introduction to the theory of psychological tests]. Bern: Huber.
Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W.
Molenaar (Eds.), Rasch models. Foundations, recent developments and
applications. New York: Springer.
Frijda, N., & Jahoda, G. (1966). On the scope and methods of cross-cultural
research. International Journal of Psychology, 1, 109-127.
Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of the WISC-R
different for Anglo- and Mexican-American children? Journal of Psychoeducational
Assessment, 6, 253-260.
Greenfield, P. M. (1997). You can't take it with you: Why ability assessments
don't cross cultures. American Psychologist, 52, 1115-1124.
Gustafsson, J-E. (1984). A unifying model for the structure of intellectual
abilities. Intelligence, 8, 179-203.
Hakstian, A. R., & Vandenberg, S. G. (1979). The cross-cultural
generalizability of a higher-order cognitive structure model. Intelligence, 3, 73-103.
Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for
independent alpha coefficients. Psychometrika, 41, 219-231.
Hennessy, J. J., & Merrifield, P. R. (1976). A comparison of the factor
structures of mental abilities in four ethnic groups. Journal of Educational
Psychology, 68, 754-759.
Inductive Reasoning 43
Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning.
Hillsdale, NJ: Erlbaum.
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of
employment tests by race: A comprehensive review and analysis. Psychological
Bulletin, 86, 721-735.
Irvine, S. H. (1969). Factor analysis of African abilities and attainments:
Constructs across cultures. Psychological Bulletin, 71, 20-32.
Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology
and its contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H.
Poortinga (Eds.), Cross-cultural contributions to psychology. Lisse, the Netherlands:
Swets & Zeitlinger.
Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In
S. H. Irvine & J. W. Berry (Eds.), Human abilities in cultural context. Cambridge:
Cambridge University Press.
Jahoda, G., & Krewer, B. (1997). History of cross-cultural and cultural
psychology. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of crosscultural psychology (2nd ed., vol. 1). Chicago: Allyn & Bacon.
Jensen, A. R. (1977). Cumulative deficit in intelligence of Blacks in the rural
South. Developmental Psychology, 13, 184-191.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Little, T. D. (1997). Mean and covariance structures (MACS) analyses of
cross-cultural data: Practical and theoretical issues. Multivariate Behavioral
Research, 32, 53-76.
Inductive Reasoning 44
McCrae, R. R., & Costa, P. T., (1997). Personality trait structure as a human
universal. American Psychologist, 52, 509-516.
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias.
Journal of Educational Statistics, 7, 105-118.
Messick, S. (1988). Validity. In R. L. Linn (Ed.), Educational measurement
(3rd ed). Hillsdale, NJ: Erlbaum.
Muthén, B. O. (1991). Multilevel factor analysis of class and student
achievement components. Journal of Educational Measurement, 28, 338-354.
Muthén, B. O. (1994). Multilevel covariance structure analysis. Sociological
Methods & Research, 22, 376-398.
Naglieri, J. A., & Jensen, A. R. (1987). Comparison of Black-White
differences on the WISC-R and the K-ABC: Spearman's hypothesis. Intelligence, 11,
21-43.
Poortinga, Y. H., & Van de Vijver, F. J. R. (1987). Explaining cross-cultural
differences: Bias analysis and beyond. Journal of Cross-Cultural Psychology, 18,
259-282.
Ree, M. J., & Carretta, T. R. (1995). Group differences in aptitude factor
structure on the ASVAB. Educational and Psychological Measurement, 55, 268-277.
Reschly, D. (1978). WISC-R factor structures among Anglos, Blacks,
Chicanos, and Native-American Papagos. Journal of Consulting and Clinical
Psychology, 46, 417-422.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression
and Mantel-Haenszel procedures for detecting differential item functioning. Applied
Psychological Measurement, 17, 105-116.
Inductive Reasoning 45
Rogoff, B. (1981). Schooling and the development of cognitive skills. In H. C.
Triandis & A. Heron (Eds.), Handbook of cross-cultural psychology: Volume 4,
Developmental psychology. Boston: Allyn & Bacon.
Sandoval, J. (1982). The WISC-R factorial validity for minority groups and
Spearman's hypothesis. Journal of School Psychology, 20, 198-204.
Serpell, R. (1979). How specific are perceptual skills? British Journal of
Psychology, 70, 365-380.
Serpell, R. (1993). The significance of schooling. Life journeys in an African
society. Cambridge: Cambridge University Press.
Sternberg, R. J. (1977). Intelligence, information processing, and analogical
reasoning: The componential analysis of human abilities. New York: Wiley.
Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual Review of
Psychology, 49, 479-502.
Sung, Y. H., & Dawis, R. V. (1981). Level and factor structure differences in
selected abilities across race and sex groups. Journal of Applied Psychology, 66,
613-624.
Taylor, R. L., & Ziegler, E. W. (1987). Comparison of the first principal factor
on the WISC-R across ethnic groups. Educational and Psychological Measurement,
47, 691-694.
Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs,
No. 1.
United Nations (1999). Indicators on education [On-line]. Available Internet:
www.un.org/depts/unsd/social/education.htm.
Inductive Reasoning 46
Valencia, R. R., & Rankin, R. J. (1986). Factor analysis of the K-ABC for
groups of Anglo and Mexican American children. Journal of Educational
Measurement, 23, 209-219.
Valencia, R. R., Rankin, R. J., & Oakland, T. (1997). WISC-R factor
structures among White, Mexican American, and African American children: A
research note. Psychology in the Schools, 34, 11-16.
Van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied
Psychological Measurement, 10, 45-57.
Van de Vijver, F. J. R. (1997). Meta-analysis of cross-cultural comparisons of
cognitive test performance. Journal of Cross-Cultural Psychology, 28, 678-709.
Van de Vijver, F. J. R., & Leung, K. (1997a). Methods and data analysis of
comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.),
Handbook of cross-cultural psychology, 2nd Ed., Vol. 1. Chicago: Allyn & Bacon.
Van de Vijver, F. J. R., & Leung, K. (1997b). Methods and data analysis for
cross-cultural research. Newbury Park, CA: Sage.
Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item
response theory. New York: Springer.
Van Haaften, E. H., & Van de Vijver, F. J. R. (1996). Psychological
consequences of environmental degradation. Journal of Health Psychology, 1, 411429.
Willemsen, M. E., & Van de Vijver, F. J. R. (under review). Context effects in
logical reasoning in the Netherlands and Zambia.
Inductive Reasoning 47
Zuckerman, M., Kuhlman, D. M., Thornquist, M., & Kiers, H. A. L. (1991). Five
(or three) robust questionnaire scale factors of personality without culture.
Personality and Individual Differences, 12, 929-941.
Inductive Reasoning 48
Table 1
Sample Size per Culture, Grade, and Experimental Condition
Gradea
Country
Test condition
Zambia
Figure
Turkey
Netherlands
Total
aIn
5
6
7
8
80
79
94
123
376
Letter
81
81
87
79
328
Figure
127
97
95
102
421
Letter
139
107
110
100
456
Figure
117
74
51
77
319
Letter
83
91
77
62
313
627
529
514
Zambia the grades are 6, 7, 8, and 9, respectively.
Total
543 2213
Inductive Reasoning 49
Table 2
Average Proportion of Correctly Solved Items per Task, Grade, and Culture
Task
Country
Zambia
Turkey
Netherlands
Grade
IRF
RCF
RGF
RTF
6
.40
.44
.53
.39
7
.55
.53
.55
8
.56
.64
9
.62
5
IRL
RCL
RGL
RTL
.49
.40
.37
.41
.43
.56
.60
.47
.56
.55
.53
.58
.61
.44
.58
.68
.64
.54
.61
.54
.48
.58
.47
.51
.44
.42
.50
.54
.39
.49
6
.48
.57
.56
.46
.52
.53
.42
.47
7
.66
.73
.64
.58
.64
.70
.56
.65
8
.65
.75
.69
.63
.64
.71
.52
.60
5
.67
.80
.65
.64
.60
.63
.51
.58
6
.74
.73
.72
.67
.64
.68
.57
.66
7
.70
.74
.70
.66
.68
.72
.63
.67
8
.78
.84
.76
.77
.70
.74
.60
.69
IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule
Generating Figures; RTF: Rule Testing Figures; IRL: Inductive Reasoning Letters;
RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing
Letters
Inductive Reasoning 50
Table 3
Effect Sizes of Multivariate Analyses of Variance of the Psychological Tests per Test
Mode
Skill component
Independent
Multi-
Inductive
Rule
Rule
Rule
variable
variatea
reasoning
classification
generating
testing
(a) Figure mode
Country (C)
.135***
.132***
.183***
.139***
.200***
Grade (G)
.073***
.108***
.149***
.132***
.125***
Sex (S)
.011*
.001
.002
.001
.010**
CG
.035***
.013*
.063***
.041***
.014*
CS
.012**
.016***
.017***
.014***
.014***
GS
.009**
.004
.006
.004
.011**
CGS
.009*
.010
.007
.008
.004
(b) Letter mode
Country
.102***
.078***
.113***
.130***
.113***
Grade
.061***
.122***
.114***
.088***
.125***
Sex
.014**
.010**
.002
.000
.001
CG
.030***
.035***
.051***
.028***
.051***
CS
.014***
.014**
.002
.013**
.016***
GS
.005
.007
.000
.003
.001
CGS
.014***
.017**
.012
.018**
.009
Note. Significance levels of the effect sizes refer to the probability level of the
corresponding F ratio of the independent variable(s).
aWilks’
lambda. *p < .05. **p < .01. ***p < .001.
Inductive Reasoning 51
Table 4
Overview of the Hypothesis Tests and the Statistical Models Used
Procedure to
establish equivalence
Internal
Focus on tests of
inductive
reasoning
Focus on tests of
inductive
reasoning
External
Focus on
relationship of
skill components
and inductive
reasoning
Question examined
Statistical model
used
Are facet level
difficulties and item
difficulties related?
linear logistic
model
Is there item bias?
Logistic
regression
Are tests of skill
components and
inductive reasoning
related?
structural
equation
modeling
MIMIC = Multiple Indicators MultIple Causes
Statistical aspects
Conditions for equivalence
Structural
Measurement unit Full score
equivalence
equivalence
equivalence
correlations
significant in each
country
(hypothesis 1a)
correlations
significant and
identical across
countries
(hypothesis 1b)
Absence of item
bias (hypothesis
3)
MIMIC
parameters
significant in each
country
(hypothesis 2a)
MIMIC
parameters
significant and
identical across
countries
(hypothesis 2b)
Inductive Reasoning 53
Table 5
Accuracy of the Design Matrices per Task and per Country: Means (and Standard
Deviations) of Correlation
Stimulus mode
Figures
Skill
Zam
Tur
Letters
Net
Zam
Tur
Net
Inductive reasoning
.90 (.03) .90 (.02) .81 (.03) .90 (.02) .92 (.01) .90 (.02)
Rule classification
.84 (.04) .87 (.01) .76 (.02) .92 (.01) .93 (.02) .88 (.03)
Rule generating
.88 (.01) .86 (.03) .81 (.02) .87 (.01) .89 (.02) .92 (.02)
Rule testing
.95 (.01) .93 (.01) .91 (.03) .94 (.01) .95 (.01) .94 (.01)
Net = Netherlands. Tur = Turkey. Zam = Zambia.
Inductive Reasoning 54
Table 6
Analysis of Variance of Correlations with Country, Stimulus Mode, and Skill as
Independent Variables
Source
df
F
Variance explained
Country (C)
2
24.85***
.10
Stimulus mode (S)
1
79.18***
.22
Skill (Sk)
3
37.94***
.21
CS
2
19.92***
.16
C  Sk
6
3.70**
.05
S  Sk
3
9.61***
.10
C  S  Sk
6
1.95
.03
Within-cell error
72
(.0006)
.14
*p < .05. **p < .01. ***p < .001.
Inductive Reasoning 55
Table 7
Fit Indices for Nested Multiple Indicators Multiple Causes Models of Figure and Letter Tasks

Contribution to  per

country(percentage)

Invariant matrices
Zam
Tur
Net
NNFI
GFI
RMSEA
 (df)
2 (df)
(a) Figure mode
533.97*** (167)
21
32
47
.88
.96
.045
y
437.28*** (145)
19
35
47
.89
.96
.043
96.69*** (22)
y
180.52***
(79)
20
27
53
.93
.98
.034
256.76***
(66)
y
134.01*** (46)
20
19
61
.91
.98
.042
46.51 (33)
y
98.72*** (35)
20
18
62
.90
.99
.041
35.29*** (11)
y
(b) Letter mode
473.33*** (167)
33
33
34
.89
.82
.041
y
364.63***
(145)
28
30
41
.91
.87
.037
108.70*** (22)
y
180.60*** (79)
35
26
39
.92
.90
.034
184.03*** (66)
y
82.07** (46)
56
25
19
.95
.94
.027
98.53*** (33)
y
61.61**
(35)
54
24
23
.96
.94
.027
20.46* (11)
y
Note. The choice of a MIMIC model for the figure tests was mainly based on the relatively large change of all fit statistics when
constraints were imposed on the phi matrices; therefore, the model with equal factor loadings, regression coefficients, and error
variances was chosen. The same model of free, fixed and constrained parameters also showed an adequate fit for the letter
tests. Releasing constraints on the regression coefficients revealed a significant increase of fit. The question was addressed as
to whether the decrease of the  statistic was due to systematic country differences in the regression coefficients. An inspection
of the regression coefficients per country did not show a clear patterning of country differences. The same question was also
addressed by two more analyses; in the first the regression coefficients were allowed to vary across countries but not across
grades while in the second analysis variation was allowed across grades but not across countries. It was found that equality of
regression coefficients across the four grades of a country yielded a poorer fit than equality across the three countries per grade
(first analysis:  (73, N = 1094) = 168.40, p < .001; GFI = .91, NNFI = .92, RMSEA = .035; second analysis:  (70, N = 1094) =
Inductive Reasoning 56
131.54, p < .001; GFI = .90, NNFI = .95, RMSEA = .029). The two analyses confirmed that a choice of a model of equal
regression coefficients of the letter mode across countries does not lead to the elimination of relevant country differences.
Net = the Netherlands; Tur = Turkey; Zam = Zambia; NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA =
Root Mean Square Error of Approximation; 2 = decrease of 2 value.
*p < .05. **p < .01. ***p < .001.
Inductive Reasoning 57
Figure Captions
Figure 1. Estimated facet level difficulties per test and country of the figure mode
Note. The first level of each facet (see Appendix A), arbitrarily set to zero is not
presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item
rule: 2 ; R3: Item rule: 3; P3: Number of figures per period: 3; P4: Number of figures
per period: 4; D2: Number of different elements of subsequent figures: 2 ; D3:
Number of different elements of subsequent figures: 3; DV: Number of different
elements of subsequent figures: variable; V: Variation across periods: variable; C:
Periodicity cues: absent; PR: Periods repeat each other: no; V2: Number of valid
triplets: 2; V3: Number of valid triplets: 3; F: One of the alternatives follows the rule:
no; NR: Rows repeat each other: no.
Figure 2. Estimated facet level difficulties per test and country of the letter mode
Note. The first level of each facet (see Appendix A), arbitrarily set to zero, is not
presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item
rule: 2; R3: Item rule: 3; R4: Item rule: 4; R5: Item rule: 5; L2: Number of letters: 2;
L3: Number of letters: 3; L4: Number of letters: 4; L5: Number of letters: 5; L6:
Number of letters: 6; LV: Number of letters: variable; D2: Difference in positions in
alphabet: 2; D3: Difference in positions in alphabet: 3; D4: Difference in positions in
alphabet: 4; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; V4:
Number of valid triplets: 4; V5: Number of valid triplets: 5; F: One of the alternatives
follows the rule: no; NR: Rows repeat each other: no.
Figure 3. Multiple Indicators Multiple Causes model (standardized solution).
Figure 4. Examples of biased items: (a) easy item; (b) difficult item
Inductive Reasoning 58
(b) Rule Classification Figures
2.5
1.5
2
Difficulty
Difficulty
(a) Inductive Reasoning Figures
2
1
0.5
0
1.5
1
0.5
0
-0.5
-0.5
-1
-1
R2
R3
P3
P4
D2
Facet level
D3
V
R2
PR
R3
P3
P4
D2 D3 DV
Facet level
V
C
PR
F
C
F
NR
(d) Rule Testing Figures
(c) Rule Generating Figures
2
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
Difficulty
Difficulty
1.5
1
0.5
0
-0.5
-1
R2
R3
D2
D3
Facet level
V2
Zam
V3
R2
Tur
R3
P3
Net
P4
D2
D3
DV
Facet level
V
Inductive Reasoning 59
(b) Rule Classification Letters
(a) Inductive Reasoning Letters
2.5
1.5
0.5
Difficulty
Difficulty
1.5
-0.5
0.5
-0.5
-1.5
-2.5
-1.5
-3.5
-2.5
R2
R3
R4
R5
L2
L3 L4 L5
Facet level
L6
D2
D3
R2
D4
R3
R4
R5
(c) Rule Generating Letters
L3 L4 L5
Facet level
L6
D2
D3
D4
F
L5 L6 LV D2 D3 D4
Facet level
F
NR
(d) Rule Testing Letters
2.5
3
Difficulty
1.5
Difficulty
L2
0.5
-0.5
-1.5
2
1
0
-2.5
-1
-3.5
-2
R2 R3 R4 R5 L2 L3 L4 D2 D3 D4 V2 V3 V4 V5
Facet level
Zam
R2 R3 R4 R5
Tur
Net
L3
L4
Inductive Reasoning 60
Figure mode
Rule
classification
figures
.24
.15
.73
Rule
generating
figures
.39
Inductive
reasoning
figures
Inductive
reasoning
.67
.46
Rule
testing
figures
Inductive
reasoning
letters
Letter mode
Rule
classification
letters
.37
.23
.63
Rule
generating
letters
.34
Inductive
reasoning
.74
.34
Rule
testing
letters
Inductive
reasoning
figures
Inductive
reasoning
letters
Inductive Reasoning 61
(a) Easy item
1
0.8
Zam b ia
Tu rkey
0.7
Neth erlan d s
0.6
0.5
0.4
Low
Med iu m
Hig h
Score level
(b) Difficult item
0 .7
0 .6
Average score
Averag e score
0.9
0 .5
Z am b ia
0 .4
T u rkey
0 .3
N eth erlan d s
0 .2
guessing level
0 .1
0
L ow
M ed iu m
S core level
H ig h
Inductive Reasoning 62
Appendix A: Test Facets
The following table provides a description of the facets of the examples of the figure
tests:
Faceta
Item rule
Number of figures per
period
Number of different
elements of subsequent
figures
Variation across periods
Periodicity cues
Periods repeat each
other
Number of valid triplets
Number of valid triplets
One of the alternatives
follows the rule
Level
1
2
3
2
3
4
1
2
3
Variable
constant
Variable
Present
Absent
Yes
No
1
2
3
Yes
IRF
*
RCF
Test
RGFb
RTF
*
*
*
*
*
*
*
*
*
*
*
*
*
-
*
-
*
*
*
*
*
-
*
No
Rows repeat each other
Yes
*
No
Note. An asterisk indicates that the corresponding facet level applies to the item; a
dash indicates that the facet is not present in the test. aSee the text for an
explanation of the facets. bThe description refers to the first correct answer (1-3-5).
IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule
Generating Figures; RTF: Rule Testing Figures.
Inductive Reasoning 63
The following Table provides a description of the facets of the examples of the letter
tests:
Faceta
Item rule
Number of letters
Difference in positions in
alphabet
Number of valid triplets
One of the alternatives
follows the rule
Level
1
2
3
4
5
1
2
3
4
5
6
variable
1
2
3
4
1
2
3
4
5
yes
IRL
RCL
Test
RGLb
*
RTL
*
*
*
*
*
*
*
*
-
*
*
-
*
no
yes
*
no
Note. An asterisk indicates that the corresponding facet level applies to the item; a
dash indicates that the facet is not present in the test. aSee the text for an
explanation of the facets. bThe description refers to the first correct answer (2-4-6).
IRL: Inductive Reasoning Letters; RCL: Rule Classification Letters; RGL: Rule
Generating Letters; RTL: Rule Testing Letters.
Rows repeat each other
Inductive Reasoning 64
Appendix B: Examples of test items
(a) Inductive Reasoning Figures: Subject is asked to indicate which
row consistently follows one of the item generating rules.
1
2
3
1
2
3
4
5
(Correct answer: 3)
4
5
6
7
8
9
10
11
12
Inductive Reasoning 65
(b)
Rule Classification Figures: Subject is asked to indicate which rule
applies to the eight figures.
1
2
3
4
5
6
7
8
1. One or more things are added to figures which come after each other
in a group.
2. One or more things are subtracted from figures which come after each
other in a group.
3. In turn, one or more things are added to figures which come after each
other in a group and then, the same number of things is subtracted.
4. None of the rules applies.
(Correct answer: 3)
Inductive Reasoning 66
(c) Rule Generating Figures: Subject is asked to find
one or more groups of three figures that follow one of the
item generating rules.
1
-
2
-
3
-
4
-
5
-
6
1
-
2
-
3
-
4
-
5
-
6
1
-
2
-
3
-
4
-
5
-
6
(Correct answers: 3-4-6 and 2-3-4)
Inductive Reasoning 67
(d) Rule Testing Figures: Subject is asked to indicate which
row of figures follows the rule at the top of the item.
The rule is:
There are 4 figures in a group. 1 thing is ADDED to
figures which come after each other in a group.
1
2
3
1
2
3
4
5
None of these
(Correct answer: 4)
4
5
6
7
8
Inductive Reasoning 68
(e) Inductive Reasoning Letters: Subject is asked to indicate which
group of letters does not follow the rule of the other four.
1
M L K J IH
2
G FEDCB
3
UTSRQ P
4
O NM LKH
5
XW VUTS
(Correct answer: 4)
(f) Rule Classification Letters: Subject is asked to indicate which
rule applies to the three groups of letters.
SRRRTZ
VVVWXZ
KKKCDF
1. Each group of letters has the same number of vowels.
2. Each group of letters has an equal number of identical letters
and these letters are the same across groups.
3. Each group of letters has an equal number of identical letters
and these letters are not the same across groups.
4. Each group of letters has a number of letters which appear the
same number of positions after each other in the alphabet.
5. Each group of letters has a number of letters which appear the
same number of positions before each other in the alphabet.
6. None of the rules applies
(Correct answer: 3)
Inductive Reasoning 69
(g) Rule Generating Letters: Subject is asked to find one or
more groups of three boxes of letters that follow one of the item
generating rules.
FGHLL
OAILL
V
BCIDOU
PQR
LLEA
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
(Correct answers: 2-4-6, 1-2-6, and 1-4-5)
(h) Rule Testing Letters: Subject is asked to indicate which row
of figures follows the rule at the top of the item.
The rule is:
In each box there are four vowels
1
2
3
4
5
AOUVWI
AOUVWI
BOUVWI
AOUVWJ
None of these
(Correct answer: 2)
SRZEIO
SAREIO
SAREOI
SARDDO
VGAOUI
VGAOUI
VGAOUQ
VGAPUQ
Inductive Reasoning 71
Appendix C:
Parameter Estimates of the MIMIC Model per Mode and Cultural Group
A more detailed description of the MIMIC analyses is given here. In order to simplify
the presentation and reduce the number of figures to be presented, the covariance
matrices of the four grades were pooled per country prior to the analyses (as a
consequence, the numbers in this Appendix and in Table 7 are not directly
comparable). The table presents an overview of the estimated parameters (top) and
fit (bottom). Going from the left to the right in the table, equality constraints are
increased, starting with the “core parameters” of the model, the factor loadings ( y),
followed by the regression coefficients (), the error variance of the latent construct,
labeled Inductive Reasoning (), the covariances of the predictors (), and the error
variance of the tasks of inductive reasoning ().
Cells with three different numbers represent the parameter estimates for the Dutch,
Turkish, and Zambian group, respectively (e.g., the values 1.09, .79, and 0.69 were
the factor loadings in these groups of the IRL task in the solution without any
equality constraints across cultural groups); cells with one number contain parameter
estimates that were set to be identical across countries; cells with an arrow and the
word “Same” contain values equal to its left neighboring cell. All parameter estimates
are significant (p < .05).
Schematic diagram of MIMIC models:
Rule
Classification
Fig/Let

Rule
Generating
Fig/Let


Rule
Testing
Fig/Let


 


Inductive
Reasoning
(IR)

Inductive
Reasoning
Figures
(IRF)
Inductive
Reasoning
Letters
(IRL)
 
 
Inductive Reasoning 72
y
Parameter
1.09
0.79
0.69
0.83
2
0.11
0.19
0.21
0.12 0.19 0.20
1
0.18
0.15
0.19
0.20 0.14 0.18
2
0.31
0.34
0.34
0.37 0.33 0.31
3
27.28 39.25 45.69
 Same
11
29.48 36.82 35.11
 Same
21
111.76
102.06
104.22
 Same
22
15.35 23.72 26.31
 Same
31
31.79 35.93 33.89
 Same
32
30.02
37.66
45.42

Same
33
0.44
4.48
5.86
0.30 4.28 4.73

17.39 18.05 23.25 17.61 18.27 24.53
1
18.73 15.93 19.69 19.49 15.81 19.34
2
Proportion of variance accounted fora
IR
0.97
0.79
0.79
0.98 0.79 0.80
IRF
0.43
0.54
0.54
0.48 0.53 0.49
IRL
0.46
0.45
0.40
0.37 0.47 0.45
Fit indices
13.25
8.32
3.58
38.35 (8)
(df)
(2)
(2)
(2)
.001
.016
.167
.000
prob.

12.20
(2)
 (df)
.002
prob.
0.91
0.96
0.99
0.95
NNFI
0.98
0.99
1.00
0.99
GFI
0.13
0.09
0.05
0.10
RMSEA
No equality constraints
Invariant parameters across countries
y
y
y
y
(a) Figure mode
0.82
0.83
0.83
0.82
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.18
0.34
0.33
0.33
0.33
 Same
 Same
37.99
37.99
 Same
 Same
34.14
34.14
 Same
 Same
105.56
105.56
 Same
 Same
22.20
22.20
 Same
 Same
34.06
34.06
 Same
 Same
38.08
38.08
0.43 4.39 4.83
3.10
3.10
3.41
17.42 18.25 24.35 15.27 19.25 25.80 15.27 19.25 25.80
20.08
19.69 15.82 19.42 18.44 16.41 20.29 18.44 16.41 20.29
18.12
0.97
0.47
0.34
0.80
0.55
0.48
0.79
0.49
0.45
0.54
0.40
0.86
0.51 0.45
0.46 0.42
0.57
0.43
0.84
0.51 0.44
0.45 0.40
0.84
0.51
0.43
43.42 (14)
49.67 (16)
101.92 (28)
123.82 (32)
.000
5.07 (6)
.535
0.97
0.99
0.07
.000
6.25 (2)
.044
0.97
0.99
0.08
.000
52.25 (12)
.000
0.96
0.97
0.08
.000
21.90 (4)
.000
0.96
0.96
0.09
Inductive Reasoning 73
No equality constraints
Parameter
1.49
1.23
0.76
2
0.11
0.29
0.20
1
0 11
0.07
0.11
2
0.31
0.17
0.31
3
30.24 37.06 38.53
11
38.14 49.81 49.59
21
161.85 199.72 196.20
22
15.25 17.54 18.99
31
29.74 39.36 46.45
32
19.14 25.90 32.63
33
2.38
1.78
7.62

21.74
14.95
30.96
1
11.03 15.49 21.21
2
y
1.21
0.13 0.30 0.15
0.12 0.07 0.09
0.36 0.18 0.21
 Same
 Same
 Same
 Same
 Same
 Same
2.85 1.81 4.41
21.49 14.91 35.30
12.22 15.55 19.36
y
y
y
y
(b) Letter mode
1.18
1.19
1.19
1.12
0.23
0.22
0.22
0.22
0.09
0.09
0.09
0.10
0.23
0.23
0.23
0.24
 Same
 Same
35.55
35.55
 Same
 Same
46.41
46.41
 Same
 Same
187.85
187.85
 Same
 Same
17.32
17.32
 Same
 Same
38.77
38.77
 Same
 Same
25.97
25.97
3.27 1.98 4.54
2.69
2.69
3.29
21.19 14.85 34.43 21.51 14.42 35.76 21.51 14.42 35.76 22.28
12.78 15.66 20.33 13.34 14.95 22.53 13.34 14.95 22.53 16.45
Inductive Reasoning 74
Proportion of variance accounted fora
IR
0.77
0.85
0.66
IRF
0.32
0.44
0.42
IRL
0.68
0.53
0.38
Fit indices
1.32
2.44
2.32
(df)
(2)
(2)
(2)
.517
.295
.313
prob.
(df)
0.79
0.39
0.62
0.85
0.44
0.53
0.64
0.26
0.48
0.72
0.35
0.56
0.84
0.45
0.52
0.71
0.31
0.52
0.34
0.53
0.81
0.47
0.55
0.28
0.46
0.37
0.57
0.79
0.47
0.54
0.26
0.44
0.76
0.38
0.51
24.42 (8)
54.32 (14)
57.35 (16)
104.51 (28)
187.75 (32)
.002
18.34 (2)
.000
0.97
0.98
0.07
.000
29.9 (6)
.000
0.96
0.98
0.09
.000
3.03 (2)
.220
0.96
0.97
0.08
.000
47.16 (12)
.000
0.96
0.96
0.08
.000
83.24 (4)
.000
0.93
0.92
0.12
prob.
1.01
1.00
1.00
NNFI
1.00
1.00
1.00
GFI
0.00
0.02
0.02
RMSEA
Note. Values in cells refer to nonstandardized solution; 1 is fixed at a value of one.
aThe last three rows refer to proportions of variance accounted for in the latent variable and the two inductive reasoning tasks,
respectively.
IR: Inductive Reasoning (latent construct). IRF: Inductive Reasoning Figures. IRL: Inductive Reasoning Letters. NNFI =
Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation.
Download