A Thesis - Sacramento

advertisement
AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX
A Thesis
Presented to the faculty of the Department of Psychology
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF ARTS
in
Psychology
(Industrial/Organizational Psychology)
by
Leanne M. Williamson
SPRING
2012
AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX
A Thesis
by
Leanne M. Williamson
Approved by:
__________________________________, Committee Chair
Lawrence S. Meyers, Ph.D.
__________________________________, Second Reader
Tim W. Gaffney, Ph.D.
__________________________________, Third Reader
Jianjian Qin, Ph.D.
____________________________
Date
ii
Student: Leanne M. Williamson
I certify that this student has met the requirements for format contained in the University format
manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for
the thesis.
__________________________, Graduate Coordinator ___________________
Jianjian Qin, Ph.D.
Date
Department of Psychology
iii
Abstract
of
AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX
by
Leanne M. Williamson
The Internal Control Index (ICI) is a 28-item measure of locus of control. The present study
utilized item response theory (IRT) to determine if the length of the instrument can be reduced
to improve its psychometric properties. Students at CSU Sacramento (N = 631) completed the
ICI for course credit. When the scale was reduced to 11 items, the graded response model
demonstrated a good fit, M2(484) = 823.34, p < .001, RMSEA = .03. Because summed scores
were strongly linearly related to IRT scale scores (θ), r(596) = .997, p < .001, they were
considered to be a good approximation of locus of control. Additionally, summed scores on the
original and new scoring strategies correlated highly, r(596) = .82, p < .001, and both versions
correlated similarly with related constructs. It therefore appears that the ICI-R may provide an
equally valid and more efficient measure of locus of control.
_______________________, Committee Chair
Lawrence S. Meyers, Ph.D.
_______________________
Date
iv
ACKNOWLEDGEMENTS
Many individuals were instrumental in the completion of this thesis. I especially
appreciate the incredible time and expertise that Dr. Larry Meyers has contributed to the
development of my methodological and critical thinking skills, and for endlessly encouraging
me to pursue my passion for psychometric theory. Because of his influence, I have grown more
than I ever thought I could have in three years. My experience in the master’s program at Sac
State would not have been the same without him.
I also owe a large debt of gratitude to Tim Gaffney for the many hours we have spent
discussing the intricacies of psychometric theory and working through computations together.
Throughout the development and completion of my thesis, he has been a selfless and invaluable
resource. It has been absolutely wonderful to have access to someone with so much passion for
this subject matter.
I would also like to thank Dr. Jianjian Qin for his work as a member of my committee
and for his considerable role in inspiring my academic goals. His research methods and
statistics classes were part of what convinced me to pursue a career in research methodology,
and without that goal I never would have attempted this thesis. He has an amazing ability to
explain statistics in plain English and, I dare say, make the learning process fun.
Steve Reise personally suggested the methodology for the IRT scale revision. I have
also learned much of what I claim to know about IRT by reading his work.
Additionally, I am grateful to my research assistants for their dedication and hard work
on this project. I would especially like to thank Chereé Ramón, Ben Trowbridge, and Mike
Whitehead for their valuable contributions and good humor.
v
Last but certainly not least, I would like to thank my family and friends for their support
and understanding during the many times I seemingly disappeared from the face of the planet to
do my work. Special thanks go to my husband, Andrew Williamson, for encouraging me to
pursue my master’s degree and doing more than his share of dishes throughout the process.
Additionally, the friendship and insights of Kasey Stevens, Sanja Durman-Perez, Najia Nafiz,
and Lilly Aston have meant to world to me over the past few years.
vi
TABLE OF CONTENTS
Page
Acknowledgements....................................................................................................................... v
List of Tables ............................................................................................................................... ix
List of Figures ............................................................................................................................... x
1. INTRODUCTION.................................................................................................................. 1
Overview of CTT ................................................................................................................... 2
Overview of IRT .................................................................................................................... 5
The 2PLM ........................................................................................................................ 7
The GRM ....................................................................................................................... 10
Assumptions .................................................................................................................. 12
Dimensionality ........................................................................................................ 13
Local Independence ................................................................................................ 14
Monotonicity/Functional Form ............................................................................... 15
Advantages of Using IRT in Test Revision ................................................................... 16
Test Information Curve ........................................................................................... 16
Test Standard Error Curve ...................................................................................... 17
Test Characteristic Curve ........................................................................................ 17
Present Study........................................................................................................................ 18
2. METHOD............................................................................................................................. 21
Participants and Procedure ................................................................................................... 21
Materials............................................................................................................................... 22
ICI .................................................................................................................................. 22
Study 1 Variables........................................................................................................... 24
vii
Study 2 Variables........................................................................................................... 24
Software ............................................................................................................................... 24
3. RESULTS ............................................................................................................................ 26
Comparison of ICI Data from the Two Studies ................................................................... 26
CTT Psychometric Properties of the ICI .............................................................................. 27
Dimensionality Assessment of the ICI ................................................................................. 29
IRT Psychometric Properties of the ICI ............................................................................... 30
IRT-Based Revision of the ICI............................................................................................. 32
Dimensionality Assessment of the ICI-R ............................................................................. 33
Psychometric Properties of the ICI-R .................................................................................. 36
Comparison of the ICI and ICI-R......................................................................................... 39
4. DISCUSSION ...................................................................................................................... 41
References .................................................................................................................................. 44
viii
LIST OF TABLES
Tables
Page
1. Participant Demographic Data ............................................................................................. 22
2. ICI Items .............................................................................................................................. 23
3. CTT Item Statistics for the ICI............................................................................................. 28
4. IRT Item Parameter Estimates for the 26-Item Version of the ICI ...................................... 32
5. IRT Item Parameter Estimates for the 11-Item ICI-R .......................................................... 36
6. Predicted Summed Score to Scale Score Conversion for the ICI-R .................................... 39
7. Correlations of the ICI and ICI-R with Other Constructs .................................................... 40
ix
LIST OF FIGURES
Figures
Page
1. 2PLM trace line ...................................................................................................................... 8
2. 2PLM trace lines with different location (b) parameters ....................................................... 9
3. 2PLM trace lines with different slope (a) parameters .......................................................... 10
4. GRM trace lines ................................................................................................................... 12
5. Test information and standard error curves for the ICI-R .................................................... 37
6. Test characteristic curve for the ICI-R ................................................................................. 38
x
1
Chapter 1
INTRODUCTION
The purpose of most psychological tests is to estimate individuals’ standing on some
unobservable psychological construct (e.g., happiness). Test items are often developed based on
theory in order to assess that target construct, and responses to items on a test are implicitly
hypothesized to be manifest indicators of that construct. Validity evidence can be gathered to
support the hypothesized relationship between item responses and the construct of interest to
provide evidence for the utility of the test (Borsboom, 2005, 2006). In statistical terms, the
underlying construct is often referred to as a latent variable and the goal of a psychometric study
is to provide empirical evidence that either supports or refutes the argument that a set of items
measures the intended latent variable.
Classical test theory (CTT) and item response theory (IRT) are two somewhat loosely
defined theoretical frameworks that are routinely used to evaluate the psychometric properties of
tests, or the degree to which the statistical properties of a test support its intended interpretation
and use (Algina & Penfield, 2009). Traditionally, personality researchers have relied on CTT in
developing and scoring personality assessment instruments, whereas IRT has dominated in largescale educational testing (Embretson & Reise, 2000). However, the use of IRT modeling is
becoming more common in personality research (e.g., Chernyshenko, Stark, Chan, Drasgow, &
Williams, 2001; Edwards, 2009; Maydeu-Olivares, 2005; Waller & Reise, 2010). Although CTT
and IRT subsume different measurement models and statistical methods, statistics generated from
each framework tend to be complementary in practical testing applications (Thissen & Orlando,
2001). In fact, some researchers have argued that CTT analyses should routinely precede IRT
analyses so that items of poor psychometric quality can be screened out prior to IRT analyses
2
(e.g., Morizot, Ainsworth, & Reise, 2007). Although CTT and IRT are routinely used to develop
personality measurement instruments, both can also be productively used to refine existent
inventories. In the present study, we primarily used IRT measurement models to evaluate and
shorten a measure of locus of control.
Overview of CTT
CTT has been ubiquitous in test development since at least the 1930s (Embretson &
Reise, 2000). Defining works in CTT include Harold Gulliksen’s (1950) Theory of Mental Tests
(Embretson & Reise, 2000; Haertel, 2006; Lord & Novick, 1968) and Lord and Novick’s (1968)
Statistical Theories of Mental Test Scores (Embretson & Reise, 2000; Haertel, 2006; Thissen &
Wainer, 2001). In his classic text, Gulliksen (1950) credited the bulk of the important CTT
formulas to the early papers of Charles Spearman. In these papers, Spearman (1904a, 1904b,
1907, 1910, 1913) addressed a fundamental problem in psychological testing: All test scores are
influenced by measurement error that obscures their interpretation and methods are needed to
assess the impact of error on test scores in order to determine whether they measure anything
reliably.
A precise definition of CTT is difficult to pinpoint, but Nunnally and Bernstein (1994)
tentatively proposed that methods by which test taker attributes are estimated based on linear
combinations of responses to individual items on a test comprise CTT. For example, on a
measure of happiness with a standardized 5-point response scale for each item, an individual’s
happiness score can be expressed as the sum of that individual’s item responses. It is assumed
that item responses are on a continuous scale, and that item and test scores are linearly related to
the latent construct assessed by the test. Within a CTT framework, individuals are assumed to
have a theoretical true score on a test, which is the score that the test taker would achieve given
perfect measurement (Nunnally & Bernstein, 1994). Because tests measure latent constructs
3
imperfectly (i.e., they are influenced by measurement error), observed scores differ from true
scores. Classical true score theory can be elegantly expressed as:
X=T+E
where X is the observed score, T is the theoretical true score, and E is error. If it were possible to
administer a test an infinite number of times, the average observed score would closely
approximate the true score (Nunnally & Bernstein, 1994). A test may not be able to be
administered an infinite number of times, but an estimated true score can be obtained using the
observed score and error estimates for the test. The above formula can also be represented in
terms of variance components such that:
σX = σ T + σE
where σX is the variance of the observed scores, σT is the variance of the true scores, and σE is
error variance. In CTT, error variances are assumed to be normally distributed and unsystematic
(Algina & Penfield, 2009).
Based on the above partitioning of variances, reliability can be expressed as:
ρ=
T
X
That is, reliability (ρ) is conceptualized as the proportion of true score variance contained in a set
 when the influence of error variance on observed
of observed scores. A test is most reliable
scores is minimal; thus, a test measures a construct most precisely when reliability is high and
error is low. For most psychological tests, CTT measurement precision is typically expressed in
terms of global summary statistics such as coefficient α and the overall standard error of
measurement for a given test.
Test reliability is often conceptualized in terms of internal consistency, which means that
item responses caused by the same latent construct should covary (Zinbarg, Yovel, Revelle, &
4
McDonald, 2006). Coefficient α is particularly common index in psychological testing and is
widely regarded as a measure of internal consistency because is a function of the item
intercorrelations on a test. Coefficient α represents the lower bound on the population reliability
of a test score (McDonald, 1999). Within a CTT framework, tests with αs of .80 or greater are
generally considered to be reliable (Nunnally & Bernstein, 1994).
Coefficient α has some limitations and a number of measurement experts discourage
interpreting coefficient α as a measure of internal consistency (Bentler, 2009; McDonald, 1999;
Sijtsma, 2009). Unfortunately, coefficient α is influenced substantially by test length (due to the
method by which α is computed), with longer tests (and all else equal) more likely to demonstrate
higher values of α (Nunnally & Bernstein, 1994). Additionally, tests with overly repetitive items
assessing substantively narrow constructs tend to have very high αs but may have little predictive
validity, so high values of α are not always indicative of high-quality tests (Horn, 2005).
CTT has been a popular framework for test development for decades because it is
relatively simple and the statistics it provides are relatively easy to interpret. However, CTT is
limited in that the information it provides is sample and test dependent; that is, test and item level
statistics are estimated for the sample of applicants for the specific test administered and apply
only to that sample. Additionally, test scores are only meaningful in relation to the specific test
completed and the population from which the sample that completed it was drawn. For this
reason, results of CTT analyses conducted with one group of test takers (e.g., college students)
who completed one test do not readily generalize to different populations (e.g., job applicants) or
other tests purported to measure the same construct (Yen & Fitzpatrick, 2006). Even within a
single population, changes made to a single item can alter the interpretation of the test score. As
more items on a test are removed or replaced, previous CTT analyses of the test become less
applicable to the new version of the test.
5
Overview of IRT
IRT consists of a collection of mathematical models and statistical methods used for item
analysis and test scoring (Thissen & Steinberg, 2009). David Thissen and his colleagues (Thissen
& Orlando, 2001; Thissen & Steinberg, 2009) have documented the development of IRT,
crediting its conceptual foundation to a 1925 paper by Louis Leon Thurstone. In his paper,
Thurstone (1925) demonstrated that Binet intelligence test items could be arranged along a
continuum representing mental age (M); specifically, the difficulty of an item could be regarded
as the point on the M continuum at which 50% of children of that mental age would answer the
item correctly. Thus, Thurstone contributed the idea that items and people could be located on the
same metric, a concept underlying all IRT models (Thissen & Steinberg, 2009). In 1950,
Lazarsfeld offered two other fundamental concepts underlying all IRT models: (a) item responses
are driven by a latent variable that (b) explains the observed relationships among a set of item
responses (Thissen & Steinberg, 2009). IRT was formally introduced in Lord and Novick’s
(1968) classic text within four chapters written by Allan Birnbaum (Thissen & Steinberg, 2009).
However, the widespread implementation of IRT was limited until R. Darrel Bock and Murray
Aitkin (1981) provided an efficient method for estimating IRT parameters: maximum likelihood
method with an EM algorithm; this algorithm is currently the most widely implemented
estimation procedure in IRT software programs (Thissen & Steinberg, 2009).
IRT models account for the fact that psychological tests, such as personality inventories,
often utilize summative response scales with ordered categories rather than continuous
measurement scales (Wirth & Edwards, 2007). Individuals’ levels of θ are typically estimated
directly from the patterns of item responses using nonlinear models (Edwards, 2009). In a
unidimensional IRT analysis, items and individuals are located on the θ scale based on
individuals’ patterns of responses to items assessing a common construct. Given appropriate
6
sampling, this allows test developers to evaluate the properties of items as well as the
measurement precision of a test relative to the trait levels of the population for which the test is
intended. However, the metric of θ is indeterminate and must be fixed before IRT item and
person parameters can be estimated. Typically, this is accomplished by setting the sample mean
to zero and the sample standard deviation to unity. If the underlying latent trait is assumed to be
normally distributed, then the interpretation of the resulting parameters is similar to the
interpretation of z scores (Edwards, 2009). For example, an estimated θ of zero indicates that an
individual is of average ability relative to the rest of the sample.
IRT provides test and item level statistics that are theoretically sample independent; that
is, the statistical parameters from different samples can be directly compared after they are placed
on a common metric (de Ayala, 2009). Given acceptable model-data fit, the use of IRT in test
development provides several advantages not available using CTT analyses, including more
complex test designs (von Davier, 2010). IRT can also provide several advantages for researchers
revising personality measures. For instance, when changes are made to a test, such as the deletion
or replacement of some items, IRT linking and equating procedures allow test developers to
meaningfully compare the new measure to older versions of the measure.
Although IRT models have many theoretically desirable features, most widely used IRT
models involve strong assumptions that must be met, at least to a certain extent, before item and
person parameters can be meaningfully interpreted (Embretson & Reise, 2000). Additionally,
large sample sizes are typically needed to obtain stable estimates. No definitive sample size
guidelines can be provided and expert recommendations depend upon such factors as test length,
sample characteristics, item properties, and the specific IRT model used, but many popular
models are believed to require sample sizes of 500 test takers or more for accurate estimation (de
Ayala, 2009).
7
A variety of IRT models exist, but the most commonly applied IRT models are
parametric logistic models (Thissen & Orlando, 2001). According to Thissen and Orlando (2001),
logistic models are prevalent because they tend to provide theoretically and practically
meaningful approaches to describing item response data. Item responses can be modeled as
probabilistic functions of the psychometric properties of items and latent trait scores, given that
the properties of the items and individuals have been estimated appropriately. The matter of
which IRT model, if any, is appropriate for a given application depends in part upon the number
of response options associated with each item. Items with two response options are called
dichotomous items (e.g., true/false scales). Items with more than two response options are called
polytomous items (e.g., Likert scales). Edwards (2009) identified two models that have been of
particular interest to psychologists: the 2 parameter logistic model (2PLM; Birnbaum, 1968) for
dichotomous items and the logistic graded response model (GRM; Samejima, 1969, 1997) for
polytomous items.
The 2PLM
Using notation similar to Thissen and Steinberg (2009), the probability of endorsing an
item under the 2PLM can be expressed as:
Pi ( x i  1 |  ) 
1
1 e
 a i (  bi )


where Pi(xi = 1 | θ) is the probability
of endorsing the item i in the keyed direction (xi = 1) given
an individual’s latent trait score (θ), ai is the item slope parameter, and bi is the item location
parameter. This definition assumes that all parameter values for individuals (θ) and items (ai, bi)
are known (i.e., they have been estimated). Figure 1 displays a 2PLM trace line, a graphical
depiction of the item properties of a 2PLM item. The location parameter is measured at the point
on the trace line at which examines of the corresponding trait level (θ) have a 50/50 chance of
8
endorsing the keyed response. This occurs at the point of inflection on the 2PLM trace line. The
item in Figure 1 has a b value of 0. Based on the 2PLM, an individual of average ability (i.e., with
a θ of 0) would have approximately a 50% probability of endorsing this item.
Figure 1. 2PLM trace line. This function indicates the relationship between the latent trait (θ) and
the probability of choosing the keyed alternative.
Conceptually, the location parameter accounts for the fact that some items are easier or
more difficult to endorse than others (i.e., they have different b values). Under the 2PLM,
individuals with higher levels on the latent construct are more likely to respond to a difficult-toendorse item in the keyed direction. Figure 2 provides an example of two items with different b
values. Individuals of average ability would have a 50% probability of endorsing item 1 (b = 0),
whereas individuals would need to be a standard deviation above average on  (if a Gaussian
distribution were assumed) in order to have a 50% probability of endorsing item 2 (b = 1). Thus,
item 2 is more difficult to endorse than item 1. For example, on a scale with higher scores
indicative of greater happiness and a dichotomous response format (e.g., true/false), endorsing
item 2 would be indicative of greater happiness than endorsing item 1.
9
Figure 2. 2 PLM trace lines with different location (b) parameters. Item 2 is more difficult to
endorse than item 1.
The slope (a) parameter is analogous to a factor loading in factor analysis (McDonald,
1999). However, because the 2PLM is a nonlinear model, the slope is not constant across θ
(whereas the relationship between an item factor loading and the latent trait is assumed to be
linear in most factor analytic models). The slope parameter is measured at the point of inflection
on the trace line, which is the point at which the trace line is the steepest. Conceptually, the slope
parameter indicates how well the item differentiates between test takers with different levels of θ.
More discriminating items provide more information about a test taker’s score on the latent
construct because these items are more related to the latent construct. Figure 3 illustrates two
trace lines with the same item locations but different slopes. Item 4 (a = 2) is more strongly
related to θ than item 3 (a = 1) because it has a steeper slope. Because item 4 is more strongly
related to θ, it provides the more information about an individual’s location on the θ scale than
item 3. Continuing with our hypothetical measure of happiness, item 4 would be more indicative
of happiness than item 3.
10
Figure 3. 2 PLM trace lines with different slope (a) parameters. Item 4 is more discriminating
than item 3.
The GRM
The GRM is an extension of the 2PLM for items with polytomous response scales. For
each item, the GRM incorporates a single slope (a) parameter and multiple location (b)
parameters. The number of estimated location parameters for an item with k categorical response
options is equal to k – 1, or one less than the number of response options. Under the GRM, item
response categories are assumed to be ordered, such that higher levels of θ correspond to
respondents selecting a higher response category (e.g., strongly agree rather than agree). By
convention, the first response option is set to 0. Thus, the options for an item with k response
options are scaled as 0, 1, … k – 1. Each location parameter corresponds to the probability of
selecting category k or higher. For example, b1 indexes the probability of selecting category 1 or
higher, given the conventional approach to scaling. An interesting feature of the GRM is that the
distance (on the θ scale) between the response options is freely estimated for each item; that is,
the CTT assumptions that response options are on an equal interval scale and the response scale
11
has a consistent meaning across items do not apply under the GRM. This is because each location
parameter is estimated as a separate 2PLM item, as displayed in the upper panel of Figure 4.
The GRM for an item with five response options can be expressed as:
1
1
P
k|
) ai (bi,k)  ai (bi,k1)
i(x
i
1e
1e
or, more simply:


Pi ( xi  k | )  Pi* (k )  Pi* (k 1)
where all variables are as previously defined. The equations above make it possible to graph the

probability of responding
in each category as a function of . This is depicted in the lower panel
of Figure 4, in which increased levels of  correspond to selecting progressively higher options
on the response scale.
12
Figure 4. GRM trace lines. The upper panel illustrates the meaning of the b values for a 5 option
item. The lower panel illustrates the model-based probability of selecting each response option
for that same item.
Assumptions
As with most statistical procedures, psychometric analyses based on IRT have several
underlying assumptions. Three primary assumptions of the 2PLM and GRM are
unidimensionality, local independence, and monotonicity/functional form. Although these
assumptions are never precisely met by real data sets, test developers can be confident in the
13
results of an IRT analysis to the degree that these assumptions hold (Embretson & Reise, 2000).
These assumptions are typically made more explicit and testable in IRT, but they can also be
related to CTT assumptions, as described below.
Dimensionality. In IRT modeling, it is assumed one or more individual difference
variables explain the relationships among individuals’ responses to items on a test (McDonald,
1999). Most IRT applications are predicated on the assumption that the observed relationships
among item responses can be fully explained by a single continuous (usually normally
distributed) individual difference construct. This means that the test is unidimensional. In both
CTT and IRT, test scores that represent a single identifiable common construct are the most
readily interpretable (Gustafsson & Åberg-Bengtsson, 2010; Thissen & Orlando, 2001).
Unfortunately, no real test is purely unidimensional, which presents a quandary for
researchers wishing to apply unidimensional IRT models. As a result, there has been much debate
in the IRT literature concerning the appropriateness of a wide variety of dimensionality
assessment methods (see Embretson & Reise, 2000; Gustafsson & Åberg-Bengtsson). The goal of
many of these methods is to help the investigator judge whether a test is “unidimensional
enough” to warrant the application and substantive interpretation of a unidimensional IRT model.
Generally speaking, observed data are often consistent with a number of different dimensional
structures and the true dimensionality is not known. However, some researchers have argued that
unidimensional IRT modeling may be acceptable if the researcher can first provide evidence of a
strong general factor in the data.
A common approach to addressing the dimensionality problem is to use a combination of
exploratory and confirmatory factor analytic methods designed for ordinal data prior to fitting an
IRT model (Ackerman, Gierl, & Walker, 2003; Wirth & Edwards, 2007). These methods utilize
tetrachoric correlations for dichotomous item responses, polychoric correlations for polytomous
14
item responses, and appropriate estimation methods. Many fit indices have been developed for
use in evaluating linear confirmatory factor analysis models designed for continuous variables,
and cutoff values have been proposed for these indices to aid researchers in evaluating model fit
(e.g., Browne & Cudeck, 1993; Hu & Bentler, 1999). Some IRT researchers have endorsed
evaluating confirmatory ordinal factor analytic models using these benchmark cutoffs in
assessing whether item response data are unidimensional enough for IRT modeling (e.g.,
Edwards, 2009). However, others have argued that this approach may be be unsatisfactory and
that alternative methods are needed to assess whether data are udimensional enough for IRT
(Cook, Kallen, & Amtmann, 2009; Reise, Moore, & Haviland, 2010; Reise, under review).
One approach that has been proposed as a framework for addressing this problem is the
use of bifactor modeling (e.g., Reise et al., 2010; Reise, under review). In a bifactor model, all
items load onto a general factor that is expected to represent the construct of interest. In the most
interpretable type of bifactor structure, each item also loads onto one and only one group factor,
and the general and group factors are constrained to be orthogonal. The general factor is believed
to reflect the construct the researcher intends to measure, whereas the group factors represent
multidimensionality inherent in the data. The bifactor model can be compared to a
unidimensional model. If the data are unidimensional enough for IRT, the factor loadings on the
general factor should be similar to the factor loadings obtained from a unidimensional solution.
Substantial differences suggest that forcing multidimensional data into a unidimensional IRT
model may result in an invalid solution.
Local Independence. This assumption, also called conditional independence, is closely
related to the dimensionality assumption (Embretson & Reise, 2000). Local independence means
that, after accounting for the common construct (or constructs) that a test measures (i.e., after
fitting the IRT model), there is no remaining relationship among the items (de Ayala, 2009; Yen,
15
1993). Local independence is typically operationally defined such that, after controlling for , no
substantial relationship remains between any pair of items. This is analogous to the assumption in
CTT and factor analysis that, after controlling for the common factor, the residual item variances
or error variances are uncorrelated. If a set of items contains many locally dependent item pairs,
this indicates the presence of multidimensionality that is not accounted for by the model (Thissen
& Steinberg, 2010), which may suggest that a unidimensional model is not appropriate.
Conversely, the absence of local dependencies implies unidimensionality (Edwards, 2009).
Monotonicity/Functional Form. The monotonicity and functional form assumptions are
closely related, but distinct. Both refer to the relationship between the latent construct presumed
to underlie item responses and the probability of selecting a particular item response category. If
the monotonicity assumption is met for a 2PLM item, then the probability of endorsing an item
increases as a function of θ. For a GRM item, it is expected that individuals with greater levels of
θ will more strongly endorse an item measuring θ (e.g., select strongly agree rather than agree on
a summative response scale).
The functional form assumption is met to the extent that observed item response patterns
are consistent with the item response functions predicted by the CTT or IRT model used to
analyze a given data set. In CTT, the expected function is a straight line because CTT is based on
the general linear model. Under the 2PLM, the expected functional form assumed to fit the data is
the type of logistic “S”-shaped curve that was depicted in Figure 1. Because 2PLM trace lines
model a probabilistic relationship, the function is bounded by zero at the lower asymptote and
unity at the upper asymptote; that is, the endorsement probability cannot be less than zero or
greater than one. GRM item responses are expected to conform to the type of function that was
depicted in Figure 4.
16
Advantages of Using IRT in Test Revision
Given acceptable model-data fit, the methods and models of IRT can provide researchers
with powerful tools for revising personality tests. These advantages are due primarily to IRT
conceptualizations of measurement precision. As with IRT item-level functions (e.g., trace lines),
IRT measurement precision for a test is typically represented graphically as a function of θ and
allowed to vary across the θ continuum (as opposed to CTT, where overall summary statistics are
more commonly reported). The following test-level functions are related to IRT measurement
precision: the test information curve, test standard error curve, and test characteristic curve.
Test Information Curve. A test information curve depicts the amount of information
that a test, including all items, provides for estimating each individual’s location on the θ scale
(de Ayala, 2009). The shape of this function is directly related to the IRT parameters of a set of
test items; specifically, test information is greatest where the test items are concentrated (i.e.,
more b values are located) and the items are more related to θ (i.e., the a values are larger). A test
best differentiates between individuals of different θ levels at the point at which test information
is the highest. A test differentiates less well at θ levels where the test information function is
lower.
If a sufficient pool of items has been generated to assess a latent construct and then
appropriately calibrated using IRT, test information curves can be engineered for specific
purposes. For instance, if measurement precision is desired around a certain cut score (e.g., for a
pass/fail test), a test can be designed with item location parameters concentrated near the cut
score in order to optimally differentiate individuals who are above and below the cut score.
Alternatively, if precision is desired along the entire range of the latent construct, as is often the
case with personality measures, it can be designed to include item location parameters that span
the θ continuum in order to measure precisely across the range of θ.
17
Test Standard Error Curve. A test standard error curve depicts the amount of
measurement error in a test, controlling for θ. It is computed as the inverse of the square root of
information at each θ level (Yen & Fitzpatrick, 2006). Where the curve is lower, there is less
error in estimating an individual’s location on the θ scale. Conversely, where the curve is higher,
there is more error and thus less precision in estimating individuals’ θ scores. The θ at which the
standard error function is the lowest corresponds to the peak of the test information function;
thus, IRT measurement precision for a test is maximized for the θ (or range of θs) in which
information is the highest and error is the lowest.
Test Characteristic Curve. A test characteristic curve graphically illustrates the
relationship between θ and predicted summed scores on a test. Predicted summed scores are
analogous to CTT estimated true scores and are computed using the IRT properties of the items
on the test (Hambleton, Swaminathan, & Rogers, 1991). This computation is possible because
IRT models are probabilistic; more specifically, every level of θ is associated with some
probability of selecting each response option for an item, as depicted on trace lines. Additionally,
each option is associated with a specific value (the same type of whole-number value that is
typically used to compute observed summed scores). For a dichotomous item, the non-keyed and
keyed options are assigned values of 0 and 1, respectively. As previously discussed, when
polytomous items are analyzed with IRT the options are assigned values of 0, 1,… k – 1.
The test characteristic curve is a linear combination of the trace lines for all of the items
on a test (Yen & Fitzpatrick, 2006). Specifically, when θ is known, the probability of selecting
option k is multiplied by the numerical value of option k for every option and every test item.
These values are then summed, resulting in a predicted summed score for that θ. Predicted
summed scores look similar to actual summed scores, but are not limited to whole numbers due to
18
the method by which they are computed (i.e., they typically have decimal values). Predicted
summed scores should increase as θ increases, although this relationship is not necessarily linear.
Present Study
Locus of control (Rotter, 1954, 1966) is conceptualized as a personality construct
encompassing an individual’s general beliefs about the causes of environmental reinforcement.
Those with a more internal locus of control believe their own actions can bring about desired
events, whereas those with a more external locus of control believe that whether they receive
reinforcement generally depends on luck, chance, fate, or other factors external to themselves
(Lefcourt, 1966, 1982, 1992; Rotter, 1975, 1990). Rotter (1990) conceptualized locus of control
as a theoretically broad construct with diverse and somewhat loosely associated behavioral
indicators. The first inventory that was designed to measure locus of control was the Rotter I-E
scale (Rotter, 1966).
Although the Rotter I-E scale has been widely used in psychological research, many
measures of locus of control have been developed. Some are general like the I-E, some are
conceptually narrow (e.g., work locus of control), and some are multidimensional (see Furnham
& Steele, 1993; Goodman & Waters, 1987; Kormanik & Rocco, 2009; Lefcourt, 1982). One
conceptually broad alternative to the I-E scale is the Internal Control Index (ICI; Duttweiler,
1984), a 28-item measure of locus of control for adults. A single score represents individuals’
standing on the construct, with higher scores indicating greater levels of internal locus of control.
This instrument has generally performed favorably in CTT psychometric analyses (Duttweiler,
1984; Jacobs, 1993; Maltby & Cope, 1996; Meyers & Wong, 1988). Values of coefficient α for
the ICI (.83 to .85) tend to be somewhat higher than values for the Rotter I-E scale (.75 to .77;
Duttweiler, 1984; Goodman & Waters, 1987; Meyers & Wong, 1988), but the two measures
19
appear to demonstrate comparable patterns of correlations with related constructs (Meyers &
Wong, 1988).
Archer (1979) reviewed research literature on locus of control and anxiety, and reported
that measures of internal locus of control tend to be negatively associated with measures of trait
anxiety (Archer, 1979). Additionally, previous empirical work on achievement goal orientation
has suggested that an internal locus of control is positively associated with intrinsic motivation to
learn (learning orientation) and unassociated with extrinsic goals to perform well in academics
(performance orientation; Heintz & Steele-Johnson, 2004; Phillips & Gully, 1997). After metaanalytically examining the relationships between subjective well-being (operationalized as
satisfaction with life, high positive affect, and low negative affect) and a myriad of other
variables, DeNeve and Cooper (1998) concluded that locus of control was one of the best
predictors of subjective well-being.
Although the ICI has performed reasonably well in previous research, some potential
weaknesses in its internal structure have been identified. For example, Goodman and Waters
(1987) reported that the ICI did not demonstrate acceptable levels of convergent validity with
scales of four other inventories that were purported to measure various dimensions of locus of
control, suggesting that locus of control might be a multidimensional construct. Additionally,
CTT item analyses (Duttweiler, 1984; Jacobs, 1993; Maltby & Cope, 1996; Meyers & Wong,
1988) have shown that some items within the ICI tend to correlate relatively weakly with the total
test score. Exploratory dimensionality analyses have been conducted on the ICI in two studies,
both utilizing orthogonal varimax rotation. Based on her principal axis factor solution, Duttweiler
(1984) suggested that the inventory was composed of two factors, whereas Meyers and Wong
(1988) utilized principal components analysis and suggested that the inventory assessed three
principal components, with multiple items not loading on a factor/component in both studies. We
20
hypothesized that the psychometric properties of the ICI could be clarified and improved using an
IRT measurement model and other methods appropriate for ordinal data. Because the underlying
theory specifies that general locus of control is a single (albeit conceptually broad) construct
(Rotter, 1975, 1990), we planned to identify a unidimensional subset of ICI items and then
evaluate the construct validity of the original and revised versions of the ICI by comparing their
correlation patterns with measures of related constructs.
21
Chapter 2
METHOD
Participants and Procedure
Data were gathered at California State University, Sacramento (CSUS) in two separate
studies. Study 1 was conducted during the Fall 2010 semester and Study 2 was conducted during
the Spring 2011 semester. Both studies incorporated the ICI, but no other variables overlapped.
Undergraduate students who were enrolled in introductory psychology courses at CSUS (N =
631) completed packets of inventories for course credit. Each participant packet in each study
contained a demographics sheet followed by the inventories. Inventories were presented in a
different random order for each participant. Studies 1 and 2 included 310 and 321 participants,
respectively. Participant demographic data for the two studies are provided in Table 1. In both
samples, participants were overwhelmingly female (about 80%), and both samples were
ethnically diverse. Average ages were 20.6 years (SD = 4.4, 17 – 57) in Study 1 and 20.5 years
(SD = 3.6, 16 – 50) in Study 2.
22
Table 1
Participant Demographic Data
Study 1
Sex
Female
Male
Not Reported
Ethnicity
American Indian
African American
Asian American
European American
Hispanic/Latino
Pacific Islander
Mixed Ethnicities
Other
Not Reported
Study 2
Frequency
%
Frequency
%
240
68
2
77.4
21.9
.6
261
59
0
81.3
18.4
.0
2
20
79
114
43
10
35
6
1
.6
6.5
25.5
36.8
13.9
3.2
11.3
1.9
.3
3
21
64
114
57
14
31
15
1
.9
6.5
19.9
35.5
17.8
4.
9.7
5.0
.3
Materials
ICI
The ICI was administered during both studies. The 28 items of the ICI are provided in
Table 2. Participants utilize a 5-point response scale (1 = rarely, 2 = occasionally, 3 = sometimes,
4 = frequently, 5 = usually) to respond to the items. After recoding the 14 reverse scored items
(indicated in Table 2), a mean or summed score is computed on the 28 items. A single score is
used to represent individuals’ standing with higher scores indicating greater levels of internal
locus of control. Values of coefficient α were similar for the two studies (Study 1, α = .85; Study
2, α = .81).
23
Table 2
ICI Items
Item
1
2
3
4
5
6
Content
When faced with a problem I try to forget it.a
I need frequent encouragement from others to keep working at a difficult task.a
I like jobs where I can make decisions and be responsible for my own work.
I change my opinion when someone I admire disagrees with me.a
If I want something I work hard to get it.
I prefer to learn the facts about something from someone else rather than having to dig
them out myself.a
I will accept jobs that require me to supervise others.
7
I have a hard time saying “no” when someone tries to sell me something.a
8
I like to have a say in any decisions made by any group I’m in.
9
10 I consider the different sides of an issue before making any decisions.
11 What other people think has a great influence on my behavior.a
12 Whenever something good happens to me I feel it is because I earned it.
13 I enjoy being in a position of leadership.
14 I need someone else to praise my work before I am satisfied with what I’ve done.a
15 I am sure enough of my opinions to try to influence others.
16 When something is going to affect me I learn as much about it as I can.
17 I decide to do things on the spur of the moment.a
18 For me, knowing I’ve done something well is more important than being praised by
someone else.
19 I let other peoples demands keep me from doing things I want to do.a
20 I stick to my opinions when someone disagrees with me.
21 I do what I feel like doing, not what other people think I ought to do.
22 I get discouraged when doing something that takes a long time to achieve results.a
23 When part of a group I prefer to let other people make all the decisions.a
24 When I have a problem I follow the advice of friends or relatives.a
25 I enjoy trying to do difficult tasks more than I enjoy doing easy tasks.
26 I prefer situations where I can depend on someone else’s ability rather than my own.a
27 Having someone important tell me I did a good job is more important to me than
feeling I’ve done a good job.a
28 When I’m involved in something I try to find out all I can about what is going on, even
when someone else is in charge.
a
Item is reverse scored.
24
Study 1 Variables
Study 1 included measures designed to assess academic goals and related variables. The
inventories relevant to locus of control were the State-Trait Anxiety Inventory – Form Y (STAI;
Spielberger, Gorsuch, Lushene, Vagg, & Jacobs, 1983) and the Achievement Goal QuestionnaireRevised (AGQ-R; Elliot & Murayama, 2008). The STAI consists of two scales assessing state
and trait anxiety, with higher scores indicating greater levels of anxiety. The trait anxiety scale
(α = .92) was used with its 4-point response scale. The AGQ-R assesses four dimensions of goal
orientation utilizing a 5-point response scale. The mastery approach (α = .75) and performance
approach (α = .81) dimensions were of interest in the present study. A higher score on the mastery
approach dimension is indicative of a learning orientation, whereas a higher score on the
performance approach dimension is indicative of a performance orientation.
Study 2 Variables
Study 2 included inventories designed to assess psychological well-being and related
constructs. Of these, the inventories most relevant to locus of control were the Meaning in Life
Questionnaire (MLQ; Steger, Frazier, Oishi, & Kaler, 2006) and the Positive and Negative Affect
Schedule – Expanded Form (PANAS-X; Watson & Clark, 1994), both utilizing 5-point response
scales. The presence scale (α = .87) of the MLQ was used, with higher scores indicating greater
levels of subjective meaning. The positive (α = .86) and negative (α = .87) affect scales of the
PANAS-X were used, with higher scores indicating greater tendencies to experience positive and
negative moods, respectively.
Software
Data were analyzed using PASW (SPSS) Statistics version 18.0, CEFA (Browne,
Cudeck, Tateneni, & Mels, 2008), IRTPRO (Cai, du Toit, & Thissen, 2011), and LISREL
(Jöreskog & Sörbom, 2004). SPSS was used to prepare data files and perform all CTT analyses.
25
CEFA was used to generate polychoric correlations and to perform exploratory ordinal item
factor analyses. In IRTPRO, maximum marginal likelihood estimation with an EM algorithm
(Bock & Aitkin, 1981) was used to estimate IRT item and person parameters for unidimensional
models. Respondent scores for unidimensional models were computed in IRTPRO using expected
a posteriori (EAP) estimation. A Metropolis-Hastings Robbins-Monro algorithm was used to
estimate multidimensional factor analytic and IRT model parameters in IRTPRO. LISREL was
used to fit confirmatory ordinal factor models.
26
Chapter 3
RESULTS
Participants for the two studies were compared on ICI scores to determine if the ICI data
from the two studies could be combined. The following steps were then taken: the psychometric
properties of the ICI were evaluated; the ICI was revised using IRT, resulting in the Internal
Control Index-Revised (ICI-R); and the psychometric properties of the ICI-R were evaluated.
Finally, the ICI and ICI-R were compared and correlated with other psychological constructs to
provide evidence as to whether the revision changed the meaning of the latent construct (θ)
measured by the inventory.
Comparison of ICI Data from the Two Studies
A one-way between-subjects analysis of variance was used to compare ICI mean scores
from Study 1 and Study 2 to determine if the data could be combined. No statistically significant
difference was found on ICI score between the participants who completed Study 1, M = 3.66,
95% CI [3.61, 3.72], SD = .46, and those who completed Study 2, M = 3.62, 95% CI [3.58, 3.67],
SD = .43, F(1, 596) = 1.30, p = .25, 2 = .002. Cohen’s d was found to be .09, suggesting that the
data were not substantially influenced by context effects caused by their inclusion in different
studies. Thus, the data from the two studies were combined into a sample of 631 for all of the
remaining analyses of the ICI.
27
CTT Psychometric Properties of the ICI
The coefficient α of the scale using data from both studies combined was .83. CTT item
statistics for the ICI are provided in Table 3, including item means, corrected item-total
correlations, and the percentage of participants who endorsed each response option. Two items
with particularly low corrected item-total correlations were identified; items 17 (“I decide to do
things on the spur of the moment.”) and 24 (“When I have a problem I follow the advice of
friends or relatives.”) had corrected item-total correlations of .04 and .12, respectively, and did
not seem as relevant to locus of control as some other items on the scale; they were therefore
removed from all subsequent ICI scale analyses.
Item responses were generally evenly balanced across categories three through five.
However, for most items, the first (M = 4.4%, SD = 3.4%) and second (M = 11.4%, SD = 7.0%)
response options were selected relatively less frequently and so were collapsed to facilitate item
factor analyses and IRT analyses. The percentage of missing data by item was very low (NR 
1.1%). However, 33 individuals declined to respond to at least one ICI item. Because the missing
responses were so sparse and evenly distributed across items, we elected to simply use listwise
deletion, resulting in a sample size of 598 for all subsequent analyses.
28
Table 3
CTT Item Statistics for the ICI
Corrected
Item-Total
Correlation
.21
.41
.39
.41
.43
.32
.39
.28
.41
.35
.43
.25
.43
.31
.34
.39
.04
.38
.41
.50
.42
.34
.53
.12
.34
.42
.38
.35
Item
M
1
3.86
2
3.56
3
4.04
4
3.97
5
4.40
6
3.38
7
3.72
8
3.64
9
3.92
10
4.10
11
3.27
12
3.91
13
3.44
14
3.51
15
3.37
16
3.88
17
2.97
18
3.90
19
3.82
20
3.86
21
3.70
22
3.21
23
3.80
24
2.57
25
3.12
26
3.94
27
3.53
28
3.63
Min.
Max.
M
SD
a
No response (missing data).
1
3.3
4.9
1.0
1.4
.3
6.3
4.6
8.7
2.1
.8
6.8
2.2
7.6
6.0
4.8
1.6
12.4
2.1
2.7
1.1
2.2
6.5
1.6
13.3
6.8
2.1
6.7
2.9
.3
13.3
4.4
3.4
Response Option (% Endorsement)
2
3
4
5
5.5
28.2
26.1
36.5
12.0
30.0
26.9
26.0
4.8
19.5
38.4
36.3
5.9
26.8
26.0
39.6
2.2
10.3
30.9
55.8
14.9
33.3
25.8
19.3
10.8
23.1
30.9
30.0
12.8
19.5
23.5
35.0
8.4
20.3
35.7
33.3
4.4
17.6
37.4
39.6
18.2
32.3
26.0
16.2
5.1
22.5
39.1
30.9
14.7
27.6
25.5
24.4
14.3
25.4
29.6
24.2
12.5
36.8
33.0
12.7
8.1
22.5
36.1
31.4
21.6
33.4
21.1
10.5
7.0
23.8
32.8
34.1
8.9
24.4
31.2
32.6
7.9
23.3
41.0
26.3
9.2
27.1
37.4
23.5
17.6
35.8
26.6
12.8
8.4
27.1
34.5
28.1
37.1
32.8
11.6
4.6
16.6
43.7
21.6
10.8
5.7
23.3
35.2
33.4
14.3
24.7
28.4
25.5
11.1
30.0
33.8
22.2
2.2
10.3
11.6
4.6
37.1
43.7
41.0
55.8
11.4
26.6
30.2
27.0
7.0
6.8
6.6
11.0
NRa
.3
.2
.2
.3
.5
.3
.6
.5
.3
.2
.5
.2
.2
.5
.3
.3
1.1
.3
.2
.3
.6
.6
.3
.6
.5
.3
.5
.2
29
Dimensionality Assessment of the ICI
Exploratory item factor analyses were conducted on the 26 remaining items to investigate
the dimensional structure of the data. First, a matrix of polychoric correlations was computed in
CEFA. Upon examination of the correlation matrix, it was evident that several reverse scored
items did not seem to be at all correlated with some of the other items on the inventory. Six
eigenvalues were greater than one (6.4, 2.8, 1.6, 1.4, 1.2, & 1.1, respectively). Exploratory item
factor analyses were performed on the polychoric correlations in CEFA using ordinary least
squares extraction and oblique quartimax rotation. Oblique quartimax rotation is equivalent to
direct quartimin rotation (Browne, 2001) and has been endorsed by several prominent factor
analysts (Browne, 2001; Edwards, 2009; Preacher & MacCallum, 2003). For the ICI, one, two,
and three dimensional factor solutions were explored. Factor loadings (pattern coefficients) of .40
or larger were considered to be meaningful. Residual correlations were examined to determine if
the factor solutions accounted reasonably well for the observed inter-item covariation. As
suggested by Morizot et al. (2007), residual correlations of .20 or greater were flagged as
evidence of a factor solution that did not adequately account for the structure of the data.
Extraction of a single factor resulted in seven items (1, 6, 8, 12, 14, 22, & 27) not loading
meaningfully on the factor and 16 large residual correlations (r  .20), suggesting an inadequate
solution. The two factor solution consisted primarily of straightforwardly worded items loading
onto Factor 1 and reverse scored items loading onto Factor 2. The two factors were moderately
intercorrelated (r = .32). However, five items (1, 6, 18, 20 & 23) did not load clearly on a factor
and four large residual correlations remained. In the three factor solution, seven items (1, 5, 6, 12,
20, 21, 23, & 25) did not load meaningfully onto a factor and the factors themselves did not
appear to be particularly interpretable, although only one large residual correlation remained. The
two and three factor solutions did not correspond well to the two factor orthogonal principal axis
30
factor solution reported by Duttweiler (1984) or the three factor orthogonal principal components
solution reported by Meyers and Wong (1988). It is unclear whether this was due to the
differences in the data analytic methods used, sampling error, or the unclear factor structure of the
ICI. Regardless, we determined that the ICI was in need of revision in order to clarify its factor
structure.
IRT Psychometric Properties of the ICI
The GRM was fit to the 26 items. The M2 goodness-of-fit statistic (Maydeu-Olivares &
Joe, 2005; Cai, Maydeu-Olivares, Coffman, & Thissen, 2006) is a χ2 distributed statistic and was
used to assess overall IRT model-data fit. Smaller values of M2 indicate better model-data fit, but
some misfit is generally expected when strong parametric IRT models such as the GRM are
applied to real data (Cai et al., 2011). For this reason, the M2 statistic was supplemented by a root
mean error of approximation (RMSEA) statistic. This index is not as sensitive to sample size or
over-parameterization as χ2 distributed statistics (Browne & Cudeck, 1993) and is routinely used
to assess the fit of structural equation models. Maydeu-Olivares, Cai, and Hernández (2011)
demonstrated that this index tends to yield similar interpretations whether a confirmatory factor
analysis or IRT model is applied. Specific cutoff values for interpreting RMSEA are not fully
agreed upon; however, we chose to use a cutoff of .06 based on the work of Hu and Bentler
(1999). Thus, values less than .06 indicate acceptable fit and larger values indicate misfit. Some
misfit was indicated for the reduced 26-item version of the ICI, M2(247) = 1349.98, p < .001,
RMSEA = .09.
Trace line fit and local independence were also assessed. The functional form assumption
was evaluated using the S-X2 statistic (Orlando & Thissen, 2000; 2003) and a nominal alpha level
of .05. For each item, the S-X2 statistic summarizes the relationship between the trace line(s)
predicted by the model and the empirical trace line(s) based on summed scores. A statistically
31
significant value indicates item-model misfit because the item response function predicted by the
model differs from the pattern observed in the data. Values of S-X2 indicated poor trace line fit for
six ICI items. The local item independence assumption was evaluated using a G2-based local
dependence diagnostic statistic (herein referred to as LD), which was developed by Chen and
Thissen (1997). Values greater than 10 on this index indicate substantial local item dependence
(Cai et al., 2011). A total of 325 item pairs were checked for LD in each sample. Six pairs (2 &
14, 4 & 11, 7 & 13, 14 & 18, 14 & 27, 16 and 28) demonstrated local dependence (LD values >
10).
The IRT item parameters for the 26-item version of the ICI are provided in Table 4.
According to de Ayala (2009), good values for item slope (a) parameters range from
approximately .8 to 2.5. According to DeMars (2010), item location (b) parameters for useful
items range from approximately -2 to 2. The ICI a parameter values ranged from .43 to 1.47 and
the b parameter values ranged from -5.72 to 3.33. Based on de Ayala’s guideline, some of the a
parameter values were quite low, indicating that the items likely do not assess the latent construct
(θ) underlying responses to the ICI items with steeper slopes. However, 18 of the 26 items had a
parameter values greater than .80. Additionally, the first location parameter value (b1) for item 1
was particularly low. Location parameter values for the remaining items were generally
reasonable.
32
Table 4
IRT Item Parameter Estimates for the 26-Item Version of the ICI
Item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
20
21
22
23
25
26
27
28
a
.43
.74
1.18
.84
1.24
.67
1.12
.53
1.22
.95
.89
.73
1.21
.53
.97
1.10
.88
.85
1.47
1.13
.62
1.31
.85
.98
.68
.92
(SE)
(.09)
(.09)
(.12)
(.10)
(.13)
(.09)
(.11)
(.09)
(.12)
(.11)
(.10)
(.09)
(.12)
(.09)
(.10)
(.11)
(.10)
(.10)
(.13)
(.11)
(.09)
(.12)
(.10)
(.11)
(.09)
(.10)
b1 (SE)
-5.72 (1.14)
-2.38 (.30)
-2.84 (.26)
-3.35 (.39)
-3.50 (.35)
-2.19 (.30)
-1.88 (.18)
-2.60 (.44)
-2.24 (.20)
-3.46 (.37)
-1.43 (.17)
-3.75 (.48)
-1.31 (.13)
-2.76 (.46)
-1.91 (.20)
-2.41 (.23)
-2.97 (.33)
-2.68 (.30)
-2.17 (.17)
-2.14 (.20)
-2.00 (.30)
-2.11 (.18)
-1.55 (.19)
-2.97 (.31)
-2.18 (.30)
-2.33 (.25)
b2
-1.27
-.21
-1.15
-.95
-1.91
.25
-.53
-.75
-.92
-1.50
.35
-1.34
-.02
-.31
.14
-.90
-.98
-.79
-.74
-.52
.70
-.55
.94
-1.01
-.34
-.41
(SE)
(.31)
(.12)
(.12)
(.14)
(.17)
(.13)
(.09)
(.19)
(.10)
(.17)
(.11)
(.19)
(.08)
(.17)
(.10)
(.11)
(.14)
(.13)
(.08)
(.09)
(.17)
(.08)
(.14)
(.13)
(.13)
(.10)
b3
1.41
1.59
.59
.54
-.26
2.27
.90
1.19
.71
.52
2.07
1.22
1.18
2.32
2.24
.85
.86
.97
.92
1.24
3.33
.90
2.72
.82
1.67
1.55
(SE)
(.34)
(.23)
(.10)
(.13)
(.08)
(.32)
(.12)
(.25)
(.10)
(.12)
(.24)
(.19)
(.13)
(.40)
(.23)
(.12)
(.14)
(.15)
(.10)
(.14)
(.48)
(.11)
(.31)
(.13)
(.25)
(.19)
IRT-Based Revision of the ICI
The IRT portion of the revision process was accomplished iteratively. After the
preliminary analysis, the eight items with a parameters below .80 were deleted based on de
Ayala’s (2009) guideline. Additionally, only one item from each locally dependent pair was
retained because locally dependent items are, to some extent, psychometrically redundant
(Thissen & Steinberg, 2010) and LD can bias slope parameter estimation (Chen & Thissen, 1997;
Yen, 1993). The item retained from each pair was the one that appeared to be more clearly
written and subjectively more relevant to the core construct. Three of our research assistants, all
33
of whom had a basic understanding of locus of control theory, helped the author to make these
judgments. In some cases, items with substantial LD and low slopes overlapped (2 items after the
preliminary analysis), but this was generally not the case. The items with LD were judged to be
very similar in meaning. Items were pruned to remove those with low slopes (a < .80) and reduce
LD until only the 11 items of the ICI-R remained.
Dimensionality Assessment of the ICI-R
A dimensionality assessment was conducted on the 11-item ICI-R, involving a variety of
exploratory and confirmatory factor analyses. A matrix of polychoric correlations was computed
in CEFA and the first four eigenvalues were 4.0, 1.2, 1.0, and .83, respectively. One, two, and
three dimensional exploratory solutions were considered. Factor loadings for the one dimensional
solution ranged from .41 to .66 and all residual correlations were less than .20, suggesting that a
unidimensional model was plausible. The two dimensional solution resulted in two highly
correlated (r = .58) but interpretable factors. Factor 1 represented independence (items 5, 10, 16,
18, 21, & 25) and Factor 2 represented leadership (items 3, 9, 13, 15, & 23). The three
dimensional solution was not interpreted because it was not clearly structured. A two dimensional
full information exploratory factor analysis conducted in IRTPRO corroborated the two
dimensional factor structure from the CEFA solution, with similar factor loadings and a
correlation of .57 between the two factors.
Confirmatory factor analyses were conducted in LISREL to evaluate the unidimensional
and two dimensional solutions. Diagonally weighted least squares estimation was used on a
matrix of polychoric correlations, as recommended by Wirth and Edwards (2007). We planned to
evaluate the fit of the unidimensional solution using the RMSEA, the comparative fit index (CFI),
the goodness of fit index (GFI), and the root mean square residual (RMSR). Planned cutoffs for
judging fit were based on the recommendations of Hu and Bentler (1999), who suggested that
34
acceptable models have RMSEA < .06, CFI > .95, GFI > .95, and RMSR < .08. However, for all
tested RMSEA was estimated to be 0 and CFI was estimated to be 1.0, suggesting that these
statistics may not have been computed correctly. As a result, only the GFI and RMSR values
were interpreted. The fit of the unidimensional model was acceptable, GFI = .95, RMSR = .07.
The fit of the two dimensional solution was also acceptable, GFI = .97, RMSR = .05, and the two
factors were highly correlated, r = .73.
Because the use of such fit indices is controversial (e.g., Cook et al., 2009; Morizot et al.,
2007), bifactor analysis was used to further explore the viability of the unidimensional model.
The methods used to evaluate the bifactor analytic results were based primarily on the work of
Steve Reise and his colleagues (Reise, under review; Reise et al., 2010).
CEFA was used to conduct an exploratory bifactor analysis with an orthogonal target
rotation. Target rotation requires the researcher to specify which items are expected to load onto
which factors, as is done in a confirmatory factor analysis. The difference is that only on-factor
loadings are estimated in a confirmatory solution, whereas both on-factor and off-factor loadings
are estimated in a target rotated exploratory solution. Exploratory target rotation enables the
researcher to identify any substantial off-factor loadings that may cause problems in a
confirmatory solution. In the present case, the target matrix specified that all items would load
onto the general factor and one of two group factors. The group factors were based on the two
dimensional exploratory solution, with the factors representing independence and leadership,
respectively. No substantial off-factor loadings were identified (the largest equaled .14). Items
generally loaded more strongly onto the general factor than their respective group factors,
suggesting the presence of a dominant general factor.
LISREL was used to evaluate the bifactor model and the fit was acceptable, GFI = .97,
RMSR = .04. The results of this analysis were used to compute Sijtsma’s (2009)
35
unidimensionality index to determine the proportion of common variance accounted for by the
general factor. The obtained value of .68 suggested that more than two thirds of the common
variance could be attributed to the general factor, whereas the remainder of the common variance
was divided across the two group factors. There is no recommended cutoff for using this statistic
to determine if a dataset is unidimensional enough for IRT, but the obtained value provided
further evidence of a strong general factor. Coefficient omega hierarchical (ωH; McDonald, 1999)
is another somewhat controversial statistic that can be used to evaluate bifactor models and the
viability of dividing items into subscales (Reise, under review). Given an orthogonal solution, ωH
represents the proportion of the total item variance that is uniquely accounted for by a given
factor. Values of ωH for the general, independence, and leadership factors were .51, .10, and .18,
respectively, suggesting that the general factor accounted for about half of the total variance and
the group factors contributed little unique information. Based on these values, the general factor
was judged to be only somewhat reliable, whereas the group factors were not at all reliable after
controlling for the influence of the general factor.
Full information confirmatory factor analyses were conducted in IRTPRO. The
unidimensional and bifactor solutions were compared to determine the degree of distortion that
the secondary dimensions might cause if the data were forced into a unidimensional IRT model.
The factor loadings in the unidimensional solution were extremely similar to the general factor
loadings in the bifactor solution, indicating that fitting a unidimensional IRT modeling to these
data would not be expected to produce substantial distortion of IRT slope parameter estimates.
The results of the various dimensionality assessment methods collectively suggested that the data
were unidimensional enough for interpretation of the IRT model.
36
Psychometric Properties of the ICI-R
The IRT analysis was conducted on the 11-item ICI-R that remained after the revisions
and the GRM demonstrated a good fit, M2(484) = 823.34, p < .001, RMSEA = .03. The S-X2
statistics indicated no significant trace line misfit and the local independence criterion (LD values
< 10) was met for all item pairs. The IRT item parameters are provided in Table 5. The slope (a)
parameter values ranged from .81 to 1.57 and the location (b) parameter values ranged from -3.52
to 2.69. Thus, the a parameter values were all within the range suggested by de Ayala (2009).
Although some of the b parameter values fell outside the range recommended by DeMars (2010),
they were not considered so extreme as to be problematic.
Table 5
IRT Item Parameter Estimates for the 11-Item ICI-R
Item
3
5
9
10
13
15
16
18
21
23
25
a
1.31
1.23
1.57
1.02
1.41
1.29
1.24
.81
.94
1.07
.86
(SE)
(.13)
(.14)
(.15)
(.12)
(.13)
(.12)
(.13)
(.10)
(.11)
(.11)
(.10)
b1
-2.66
-3.52
-1.92
-3.27
-1.20
-1.57
-2.22
-3.17
-2.45
-2.42
-1.54
(SE)
(.24)
(.37)
(.15)
(.35)
(.11)
(.14)
(.20)
(.38)
(.26)
(.24)
(.19)
b2
-1.07
-1.93
-.80
-1.42
-.02
.13
-.82
-1.03
-.57
-.63
.92
(SE)
(.11)
(.18)
(.08)
(.16)
(.07)
(.08)
(.10)
(.15)
(.11)
(.10)
(.14)
b3
.56
-.26
.62
.51
1.09
1.86
.79
.91
1.43
1.03
2.69
(SE)
(.09)
(.08)
(.08)
(.11)
(.11)
(.17)
(.11)
(.15)
(.17)
(.13)
(.31)
The IRT marginal reliability was .80 for the θ (EAP) scores. The test information and
standard error curves for the ICI-R are provided in the upper and lower panels of Figure 5,
respectively. Information is the highest and error is the lowest from a θ of approximately -2 to 1,
indicating that scores on the ICI-R most precisely differentiate between individuals with values of
locus of control within this range. Less information is available as θ increases, but the test does
provide some information at the upper end of the θ continuum.
37
6
Information
5
4
3
2
1
0
-3
-2
-1
0
1
Locus of Control (θ)
2
3
-1
0
1
Locus of Control (θ)
2
3
0.7
Standard Error
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3
-2
Figure 5. Test information and standard error curves for the ICI-R. The upper and lower panels
indicate the amount of information and error provided by the test at each level of locus of
control (θ), respectively.
The test characteristic curve for the ICI-R is presented in Figure 6. Possible summed
scores on the ICI-R range from 11 to 44. As expected, predicted summed scores increase as θ
increases. The function appears to be very nearly linear across the θ scale. A Pearson correlation
coefficient was computed to compare estimates of θ and observed summed scores on the ICI-R.
38
Because summed scores were nearly perfectly linearly related to IRT estimates of θ, r(596) =
.997, p < .001, they were determined to be a good approximation of the locus of control construct
using the 11-item ICI-R scale; however, because IRT estimates of θ are on an equal interval scale
Predicted Summed Score
and this is not necessarily true for summed scores, θ values were linked to CTT summed scores.
45
40
35
30
25
20
15
10
5
0
-3
-2
-1
0
1
Locus of Control (θ)
2
3
Figure 6. Test characteristic curve for the ICI-R. This function indicates the predicted summed
score for each level of locus of control (θ).
Table 6 presents the values of the IRT estimates of locus of control (θ) corresponding to
each possible summed score. This table can be used to score the ICI-R on an equal interval scale.
The marginal reliability of the scaled scores to summed score conversion was .79, suggesting that
estimates of θ for the ICI-R can be reliably converted to CTT summed scores.
39
Table 6
Predicted Summed Score to Scale Score Conversion for the ICI-R
Pred.
Summed
Modeled
θ
SD(θ) Proportion
Score
11
-3.36
.55
12
-3.08
.53
13
-2.84
.51
14
-2.63
.49
15
-2.44
.48
16
-2.26
.47
17
-2.08
.47
.01
18
-1.92
.46
.01
19
-1.76
.46
.01
20
-1.61
.45
.01
21
-1.46
.45
.02
22
-1.31
.45
.02
23
-1.17
.44
.03
24
-1.03
.44
.03
25
-.89
.44
.04
26
-.75
.44
.04
27
-.61
.44
.05
Note. A dash (-) indicates a value less than 1%.
Pred.
Summed
Score
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
θ
SD(θ)
Modeled
Proportion
-.47
-.33
-.19
-.05
.10
.24
.39
.55
.70
.87
1.04
1.23
1.42
1.63
1.86
2.11
2.43
.44
.44
.45
.45
.45
.45
.46
.46
.47
.47
.48
.49
.51
.52
.54
.56
.60
.05
.06
.06
.06
.06
.06
.06
.06
.05
.05
.04
.03
.03
.02
.01
.01
-
Comparison of the ICI and ICI-R
The coefficient α of .78 for the 11-item ICI-R was reasonably comparable to the .83 value
for the original scale given the removal of nearly two thirds of the items. Additionally, summed
scores on the original and new scoring strategies correlated highly, r(596) = .82, p < .001. Table 7
contains Pearson correlations indexing the relationships of the ICI and ICI-R with trait anxiety,
learning orientation, performance orientation, subjective meaning, positive affect, and negative
affect. The same pattern of correlational coefficients was computed on θ scores for the ICI-R but
yielded correlations that were virtually identical (to a thousandth of a decimal place) to those for
ICI-R summed scores, so they are omitted here. Table 7 also provides Cohen’s (1988) q statistic.
40
To compute q, correlation coefficients are transformed to z scores and then the absolute value of
the difference is taken. The resulting value represents the distance between the two correlation
coefficients on the z metric. As was done in Kastner, Sellbom, & Lilienfeld (2012), the q values
were evaluated based on Jacob Cohens’s (1988) guidelines for interpreting Pearson r values; that
is, values of .10, .30, and .50 were interpreted as small, medium, and large effects, respectively.
The patterns of correlations for the ICI and ICI-R were generally similar; however, when
combined with the correlation coefficients, the q values indicated that ICI-R scores were less
related to trait anxiety (q = .19) and negative affect (q = .17) compared to ICI scores.
Table 7
Correlations of the ICI and ICI-R with Other Constructs
ICI
ICI-R
r
95% CI
r
95% CI
q
-.49***
.25***
.02
[-.57, -.40]
[.14, .35]
[-.09, .13]
-.33***
.29***
.09
[-.43, -.23]
[.18, .39]
[-.02, .20]
.19
.04
.07
Subjective Meaning
.37***
Negative Affect
-.38***
Positive Affect
.47***
Note. 95% CI = 95% confidence interval.
[.27, .46]
[-.47, -.28]
[.38, .55]
.38***
-.23***
.52***
[.28, .47]
[-.34, -.12]
[.43, .60]
.01
.17
.07
Study 1 Variables
a
Trait Anxiety
Learning Orientation
Performance Orientation
Study 2 Variablesb
a
N = 296. bN = 295.
***p < .001.
41
Chapter 4
DISCUSSION
This study utilized a variety of psychometric methods, most of which are appropriate for
ordered categorical data, to evaluate and revise the ICI, which was designed to assess locus of
control in adults. The revisions, which resulted in the 11-item ICI-R, were guided by
psychometric principles and locus of control theory (Rotter, 1966, 1990), the latter of which
suggests that locus of control is a broadly defined individual difference trait with diverse
behavioral indicators. Thus, throughout the revision process, efforts were made to clarify the
dimensional structure of the inventory without needlessly narrowing the conceptual breadth of the
underlying latent trait () assessed by the inventory.
The traditional ICI response scale consists of five ordered categories. The first response
option was generally not well-utilized by our respondents, so we combined the first two response
categories to maintain a consistent response metric across all of the items; thus, we recommend
that future researchers adopt a four-point response scale for the inventory (1 =
rarely/occasionally, 2 = sometimes, 3 = frequently, 4 = usually). Our exploratory item factor
analyses suggested an unclear dimensional structure for the inventory. This lack of clarity was
consistent with the findings of previous researchers who conducted exploratory dimensionality
assessments on the inventory (Duttweiler, 1984; Meyers & Wong, 1988). Based on our
exploratory item factor analyses and our preliminary IRT analysis, the multidimensionality in the
data appeared to be due to some items not loading on the latent trait (), a possible method effect
due to item scoring (straightforwardly worded vs. reverse scored), and some items that were so
semantically similar that they formed nuisance dimensions that were more conceptually narrow
42
than desired, given that locus of control (Rotter, 1966, 1990) is a theoretically broad construct.
Based on these observations, many ICI items were removed from the inventory, resulting in the
11-item ICI-R. We judged some of the removed items to be somewhat awkwardly written (e.g.,
“When I’m involved in something I try to find out all I can about what is going on, even when
someone else is in charge.”) or possibly even confusing to participants (e.g., “When faced with a
problem I try to forget it.”). The results of the IRT analysis on the 11-item ICI-R suggested that a
unidimensional model was acceptable, suggesting that the results of the IRT analysis can be
meaningfully interpreted and a single scale score for the ICI-R is an appropriate scoring strategy.
In order to evaluate the construct validity of the ICI-R, scores on the ICI and the ICI-R
were correlated with other constructs that were expected to be related to locus of control. The ICI
and ICI-R demonstrated similar patterns of correlations with measures of learning orientation,
performance orientation, and subjective meaning. These correlational coefficients were similar to
previously reported coefficients relating locus of control to learning and performance orientation
(Heintz & Steele-Johnson, 2004; Phillips & Gully, 1997) and subjective meaning (Zika &
Chamberlain, 1987) in college student samples. However, in comparison to ICI scores, ICI-R
scores were not as strongly related to trait anxiety or negative affect. Based on the findings of
Archer (1979), the relationship between trait anxiety and the ICI-R was more consistent with
previous findings than was the anxiety-ICI relationship. In terms of affect, the results are less
clearly interpretable, as empirical relationships between locus of control and dimensions of affect
have varied in the research literature (e.g., Christopher, Saliba, & Deadmarsh, 2009; Emmons &
Diener, 1985).
Several limitations of this study may attenuate the generalizability of the results. The
revisions that resulted in the ICI-R were inherently exploratory and based on only two very
similar samples of students at a large California university. Additionally, we did generate an a
43
priori hypothesis that the psychometric properties of the ICI could be improved by removing
some items, but specific decisions regarding which items to remove were admittedly post hoc.
Although efforts were made throughout the revision process to retain as much of the content of
the ICI as possible, it is true that the revisions were guided primarily by statistical analyses, and
thus some important content may have been lost. The ICI-R would benefit from an independent
content review by an expert on locus of control to determine whether the construct assessed is
consistent with locus of control theory. In terms of the psychometric properties of the ICI-R,
future research is needed to determine if the dimensionality of the ICI-R and the results of the
IRT analysis hold in an independent sample (i.e., cross-validation is needed). Additionally, the
ICI-R items have not been screened for differential item functioning for sex or ethnicity; it is
therefore unclear whether the results of the psychometric analyses apply equally to members of
different sex or ethnic groups. Future research may shed light on this issue.
44
REFERENCES
Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response
theory to evaluate educational and psychological tests. Educational Measurement: Issues
and Practice, 22, 37-53.
Algina, J. & Penfield, R. D. (2009). Classical test theory. In R. Millsap & A. Maydeu-Olivares
(Eds.). The Sage handbook of quantitative methods in psychology (pp. 93-122). Thousand
Oaks, CA: Sage.
Archer, R. P. (1979). Relationship between locus of control and anxiety. Journal of Personality
Assessment, 43, 617-626.
Bentler, P. M. (2009). Alpha, dimension-free, and model-based internal consistency reliability.
Psychometrika, 74, 137-143.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397-479).
Reading, MA: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46, 443-459.
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics.
New York, NY: Cambridge University Press.
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425-440.
Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis.
Multivariate Behavioral Research, 36, 111-150.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen &
J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA:
Sage.
45
Browne, M. W., Cudeck, R., Tateneni, K. & Mels G. (2008). CEFA: Comprehensive Exploratory
Factor Analysis, Version 3.03 [Computer software and manual]. Retrieved from
http://faculty.psy.ohio-state.edu/browne/
Cai, L., du Toit, S. H. C., & Thissen, D. (2011). IRTPRO: Flexible, multidimensional, multiple
categorical IRT modeling [Computer software and manual]. Chicago, IL: Scientific
Software International.
Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited information
goodness-of-fit testing of item response theory models for sparse 2p tables. British
Journal of Mathematical and Statistical Psychology, 59, 173-194.
Chen, W-H, & Thissen, D. (1997). Local dependence indexes for item pairs using item response
theory. Journal of Educational and Behavioral Statistics, 22, 265-289.
Chernyshenko, O. S., Stark, S., Chan, K.-Y., Drasgow, F., & Williams, B. (2001). Fitting item
response theory models to two personality inventories: Issues and insights. Multivariate
Behavioral Research, 36, 523-562.
Christopher, A. N., Saliba, L., & Deadmarsh, E. J. (2009). Materialism and well-being: The
mediating effect of locus of control. Personality and Individual Differences, 46, 682-686.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ:
Erlbaum.
Cook, K. F., Kallen, M. A., & Amtmann, D. (2009). Having a fit: Impact of number of items and
distribution of data on traditional criteria for assessing IRT’s unidimensionality
assuption. Quality of Life Research, 18, 447-460.
de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY:
Guilford.
DeMars, C. (2010). Item response theory. New York, NY: Oxford University Press.
46
DeNeve, K. M., & Cooper, H. (1998). The happy personality: A meta-analysis of 137 personality
traits and subjective well-being. Psychological Bulletin, 124, 197-229.
Duttweiler, P. C. (1984). The Internal Control Index: A newly developed measure of locus of
control. Educational and Psychological Measurement, 44, 209-221.
Edwards, M. C. (2009). An introduction to item response theory using the Need for Cognition
Scale. Social and Personality Psychology Compass, 3, 507-529.
Elliot, A. J., & Murayama, K. (2008). On the measurement of achievement goals: Critique,
illustration, and application. Journal of Educational Psychology, 100, 613-628.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum Associates.
Emmons, R. A., & Diener, E. (1985). Personality correlates of subjective well-being. Personality
and Social Psychology Bulletin, 11, 89-97.
Furnham, A., & Steele, H. (1993). Measuring locus of control: A critique of general, children’s,
health- and work-related locus of control questionnaires. British Journal of Psychology,
84, 443-479.
Goodman, S. H., & Waters, L. K. (1987). Convergent validity of five locus of control scales.
Educational and Psychological Measurement, 47, 743-747.
Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley.
Gustafsson, J.-E., & Åberg-Bengtsson, L. (2010). Unidimensionality and interpretability of
psychological instruments. In S. E. Embretson (Ed.), Measuring psychological
constructs: Advances in model-based approaches (pp. 123-144). Washington, DC:
American Psychological Association.
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed.) (pp.
65-110). United States: American Council on Education and Praeger Publishers.
47
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage.
Heintz, Jr., P., & Steele-Johnson, D. (2004). Clarifying the conceptual definitions of goal
orientation dimensions: Competence, control, and evaluation. Organizational Analysis,
12, 5-19.
Horn, J. L. (2005). Neglected thinking about measurement models in behavioral science research.
In A. Maydeu-Olivares & J. J. McArdle (Eds.). Contemporary psychometrics: A
festschrift for Roderick P. McDonald (pp. 101-122). Mahwah, NJ: Lawrence Erlbaum.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria vs. new alternatives. Structural Equation Modeling, 6, 1-55.
Jacobs, K. W. (1993). Psychometric properties of the Internal Control Index. Psychological
Reports, 73, 251-255.
Jöreskog, K. G., & Sörbom, D. (2004). LISREL 8.7 for Windows [Computer software].
Lincolnwood, IL: Scientific Software International, Inc.
Kastner, R. M., Sellbom, M., & Lilienfeld, S. O. (2012). A comparison of the psychometric
properties of the Psychopathic Personality Inventory full-length and short-form versions.
Psychological Assessment, 24, 261-267.
Kormanik, M. B., & Rocco, T. S. (2009). Internal versus external control of reinforcement: A
review of the locus of control construct. Human Resource Development Review, 8, 463483.
Lefcourt, H. M. (1966). Internal versus external control of reinforcement: A review.
Psychological Bulletin, 65, 206-220.
Lefcourt, H. M. (1982). Locus of control: Current trends in theory and research (2nd ed.).
Hillsdale, NJ: Lawrence Erlbaum Associates.
48
Lefcourt, H. M. (1992). Durability and impact of the locus of control construct. Psychological
Bulletin, 112, 411-414.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:
Addison-Wesley.
Maltby, J., & Cope, C. D. (1996). Reliability estimates of the Internal Control Index among UK
samples. Psychological Reports, 79, 595-598.
Maydeu-Olivares, A. (2005). Further empirical results on parametric versus non-parametric IRT
modeling of Likert-type personality data. Multivariate Behavioral Research, 40, 261-279.
Maydeu-Olivares, A., Cai, L., & Hernández, A. (2011). Comparing the fit of item response theory
and factor analysis models. Structural equation modeling, 18, 333-356.
Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2n
contingency tables: A unified framework. Journal of the American Statistical
Association, 100, 1009-1020.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum
Associates.
Meyers, L. S., & Wong, D. T. (1988). Validation of a new test of locus of control: The Internal
Control Index. Educational and Psychological Measurement, 48, 753-761.
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics: Application
of item response theory models in personality research. In R. W. Robins, R. C. Fraley, &
R. F. Krueger (Eds.), Handbook of research methods in personality (pp. 407-423). New
York, NY: The Guilford Press.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed). San Francisco, CA:
McGraw-Hill.
49
Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item
response theory models. Applied Psychological Measurement, 24, 50-64.
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit
index for use with dichotomous item response theory models. Applied Psychological
Measurement, 27, 289-298.
Phillips, J. M., & Gully, S. M. (1997). Role of goal orientation, ability, need for achievement, and
locus of control in the self-efficacy and goal-setting process. Journal of Applied
Psychology, 82, 792-802.
Preacher, K. J., & MacCallum, R., C. (2003). Repairing Tom Swift’s electric factor analysis
machine. Understanding Statistics, 2, 13-43.
Reise, S. P. (under review). The rebirth of bifactor measurement models.
Reise, S. P., Moore, T. M., Haviland, M. G. (2010). Bifactor models and rotations: Exploring the
extent to which multidimensional data yield univocal scale scores. Journal of personality
assessment, 92, 544-559.
Rotter, J. B. (1954). Social learning and clinical psychology. New York, NY: Prentice-Hall, Inc.
Rotter, J. B. (1966). Generalized expectancies for internal versus external control of
reinforcement. Psychological Monographs: General and Applied, 80 (1, Whole No. 609).
Rotter, J. B. (1975). Some problems and misconceptions related to the construct of internal versus
external control of reinforcement. Journal of Consulting and Clinical Psychology, 43, 5667.
Rotter, J. B. (1990). Internal versus external control of reinforcement: A case history of a
variable. American Psychologist, 45, 489-493.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph, 17.
50
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.),
Handbook of modern item response theory. New York, NY: Springer-Verlag. pp. 85-100.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha.
Psychometrika, 74, 107-120.
Spearman, C. (1904a). The proof and measurement of association between two things. American
Journal of Psychology, 15, 72-101.
Spearman, C. (1904b). “General intelligence” objectively determined and measured. American
Journal of Psychology, 15, 201-292.
Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American
Journal of Psychology, 18, 161-169.
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3,
271-295.
Spearman, C. (1913). Correlation of sums and differences. British Journal of Psychology, 5, 417426.
Spielberger, C. D., Gorsuch, R. L., Lushene, R., Vagg, P. R., & Jacobs, G. A. (1983). Manual for
the State-Trait Anxiety Inventory (Form Y): (“Self-Evaluation Questionnaire”). Palo
Alto, CA: Consulting Psychologists Press.
Steger, M. F., Frazier, P., Oishi, S., & Kaler, M. (2006). The Meaning in Life Questionnnaire:
Assessing the presence of and search for meaning in life. Journal of Counseling
Psychology, 53, 80-93.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D.
Thissen & H. Wainer (Eds.), Test scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum
Associates.
51
Thissen, D., & Steinberg, L. (2009). Item response theory. In R. Millsap & A. Maydeu-Olivares
(Eds.), The Sage handbook of quantitative methods in psychology (pp. 148-177).
Thousand Oaks, CA: Sage.
Thissen, D., & Steinberg, L. (2010). Using item response theory to disentangle constructs at
different levels of generality. In S. E. Embretson (Ed.), Measuring psychological
constructs: Advances in model-based approaches (pp. 123-144). Washington, DC:
American Psychological Association.
Thissen, D., & Wainer, H. (2001). Overview of Test Scoring. In D. Thissen & H. Wainer (Eds.),
Test scoring (pp. 1-19). Mahwah, NJ: Lawrence Erlbaum Associates.
Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of
Educational Psychology, 16, 433-449.
von Davier, M. (2010). Mixture distribution item response theory, latent class analysis, and
diagnostic mixture models. In S. E. Embretson (Ed.), Measuring psychological
constructs: Advances in model-based approaches (pp. 11-34). Washington, DC:
American Psychological Association.
Waller, N. G., & Reise, S. P. (2010). Measuring psychopathology with nonstandard item response
theory models: Fitting the four-parameter model to the Minnesota Multiphasic
Personality Inventory. In S. E. Embretson (Ed.), Measuring psychological constructs:
Advances in model-based approaches (pp. 147-173). Washington, DC: American
Psychological Association.
Watson, D., Clark, L. A. (1994). The PANAS-X. Manual for the Positive and Negative Affect
Schedule – Expanded Form.
Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future
directions. Psychological Methods, 12, 58-79.
52
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item
dependence. Journal of Educational Measurement, 30, 187-213.
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.),
Educational measurement (4th ed.) (pp. 111-153 United States: American Council on
Education and Praeger Publishers.
Zika, S., & Chamberlain, K. (1987). Relation of hassles and personality to subjective well-being.
Journal of Personality and Social Psychology, 53, 155-162.
Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P. (2006). Estimating generalizability to a
latent variable common to all of a scale’s indicators: A comparison of estimators for ωh.
Applied Psychological Measurement, 30, 121-144.
Download