AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX A Thesis Presented to the faculty of the Department of Psychology California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF ARTS in Psychology (Industrial/Organizational Psychology) by Leanne M. Williamson SPRING 2012 AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX A Thesis by Leanne M. Williamson Approved by: __________________________________, Committee Chair Lawrence S. Meyers, Ph.D. __________________________________, Second Reader Tim W. Gaffney, Ph.D. __________________________________, Third Reader Jianjian Qin, Ph.D. ____________________________ Date ii Student: Leanne M. Williamson I certify that this student has met the requirements for format contained in the University format manual, and that this thesis is suitable for shelving in the Library and credit is to be awarded for the thesis. __________________________, Graduate Coordinator ___________________ Jianjian Qin, Ph.D. Date Department of Psychology iii Abstract of AN ITEM RESPONSE THEORY REVISION OF THE INTERNAL CONTROL INDEX by Leanne M. Williamson The Internal Control Index (ICI) is a 28-item measure of locus of control. The present study utilized item response theory (IRT) to determine if the length of the instrument can be reduced to improve its psychometric properties. Students at CSU Sacramento (N = 631) completed the ICI for course credit. When the scale was reduced to 11 items, the graded response model demonstrated a good fit, M2(484) = 823.34, p < .001, RMSEA = .03. Because summed scores were strongly linearly related to IRT scale scores (θ), r(596) = .997, p < .001, they were considered to be a good approximation of locus of control. Additionally, summed scores on the original and new scoring strategies correlated highly, r(596) = .82, p < .001, and both versions correlated similarly with related constructs. It therefore appears that the ICI-R may provide an equally valid and more efficient measure of locus of control. _______________________, Committee Chair Lawrence S. Meyers, Ph.D. _______________________ Date iv ACKNOWLEDGEMENTS Many individuals were instrumental in the completion of this thesis. I especially appreciate the incredible time and expertise that Dr. Larry Meyers has contributed to the development of my methodological and critical thinking skills, and for endlessly encouraging me to pursue my passion for psychometric theory. Because of his influence, I have grown more than I ever thought I could have in three years. My experience in the master’s program at Sac State would not have been the same without him. I also owe a large debt of gratitude to Tim Gaffney for the many hours we have spent discussing the intricacies of psychometric theory and working through computations together. Throughout the development and completion of my thesis, he has been a selfless and invaluable resource. It has been absolutely wonderful to have access to someone with so much passion for this subject matter. I would also like to thank Dr. Jianjian Qin for his work as a member of my committee and for his considerable role in inspiring my academic goals. His research methods and statistics classes were part of what convinced me to pursue a career in research methodology, and without that goal I never would have attempted this thesis. He has an amazing ability to explain statistics in plain English and, I dare say, make the learning process fun. Steve Reise personally suggested the methodology for the IRT scale revision. I have also learned much of what I claim to know about IRT by reading his work. Additionally, I am grateful to my research assistants for their dedication and hard work on this project. I would especially like to thank Chereé Ramón, Ben Trowbridge, and Mike Whitehead for their valuable contributions and good humor. v Last but certainly not least, I would like to thank my family and friends for their support and understanding during the many times I seemingly disappeared from the face of the planet to do my work. Special thanks go to my husband, Andrew Williamson, for encouraging me to pursue my master’s degree and doing more than his share of dishes throughout the process. Additionally, the friendship and insights of Kasey Stevens, Sanja Durman-Perez, Najia Nafiz, and Lilly Aston have meant to world to me over the past few years. vi TABLE OF CONTENTS Page Acknowledgements....................................................................................................................... v List of Tables ............................................................................................................................... ix List of Figures ............................................................................................................................... x 1. INTRODUCTION.................................................................................................................. 1 Overview of CTT ................................................................................................................... 2 Overview of IRT .................................................................................................................... 5 The 2PLM ........................................................................................................................ 7 The GRM ....................................................................................................................... 10 Assumptions .................................................................................................................. 12 Dimensionality ........................................................................................................ 13 Local Independence ................................................................................................ 14 Monotonicity/Functional Form ............................................................................... 15 Advantages of Using IRT in Test Revision ................................................................... 16 Test Information Curve ........................................................................................... 16 Test Standard Error Curve ...................................................................................... 17 Test Characteristic Curve ........................................................................................ 17 Present Study........................................................................................................................ 18 2. METHOD............................................................................................................................. 21 Participants and Procedure ................................................................................................... 21 Materials............................................................................................................................... 22 ICI .................................................................................................................................. 22 Study 1 Variables........................................................................................................... 24 vii Study 2 Variables........................................................................................................... 24 Software ............................................................................................................................... 24 3. RESULTS ............................................................................................................................ 26 Comparison of ICI Data from the Two Studies ................................................................... 26 CTT Psychometric Properties of the ICI .............................................................................. 27 Dimensionality Assessment of the ICI ................................................................................. 29 IRT Psychometric Properties of the ICI ............................................................................... 30 IRT-Based Revision of the ICI............................................................................................. 32 Dimensionality Assessment of the ICI-R ............................................................................. 33 Psychometric Properties of the ICI-R .................................................................................. 36 Comparison of the ICI and ICI-R......................................................................................... 39 4. DISCUSSION ...................................................................................................................... 41 References .................................................................................................................................. 44 viii LIST OF TABLES Tables Page 1. Participant Demographic Data ............................................................................................. 22 2. ICI Items .............................................................................................................................. 23 3. CTT Item Statistics for the ICI............................................................................................. 28 4. IRT Item Parameter Estimates for the 26-Item Version of the ICI ...................................... 32 5. IRT Item Parameter Estimates for the 11-Item ICI-R .......................................................... 36 6. Predicted Summed Score to Scale Score Conversion for the ICI-R .................................... 39 7. Correlations of the ICI and ICI-R with Other Constructs .................................................... 40 ix LIST OF FIGURES Figures Page 1. 2PLM trace line ...................................................................................................................... 8 2. 2PLM trace lines with different location (b) parameters ....................................................... 9 3. 2PLM trace lines with different slope (a) parameters .......................................................... 10 4. GRM trace lines ................................................................................................................... 12 5. Test information and standard error curves for the ICI-R .................................................... 37 6. Test characteristic curve for the ICI-R ................................................................................. 38 x 1 Chapter 1 INTRODUCTION The purpose of most psychological tests is to estimate individuals’ standing on some unobservable psychological construct (e.g., happiness). Test items are often developed based on theory in order to assess that target construct, and responses to items on a test are implicitly hypothesized to be manifest indicators of that construct. Validity evidence can be gathered to support the hypothesized relationship between item responses and the construct of interest to provide evidence for the utility of the test (Borsboom, 2005, 2006). In statistical terms, the underlying construct is often referred to as a latent variable and the goal of a psychometric study is to provide empirical evidence that either supports or refutes the argument that a set of items measures the intended latent variable. Classical test theory (CTT) and item response theory (IRT) are two somewhat loosely defined theoretical frameworks that are routinely used to evaluate the psychometric properties of tests, or the degree to which the statistical properties of a test support its intended interpretation and use (Algina & Penfield, 2009). Traditionally, personality researchers have relied on CTT in developing and scoring personality assessment instruments, whereas IRT has dominated in largescale educational testing (Embretson & Reise, 2000). However, the use of IRT modeling is becoming more common in personality research (e.g., Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001; Edwards, 2009; Maydeu-Olivares, 2005; Waller & Reise, 2010). Although CTT and IRT subsume different measurement models and statistical methods, statistics generated from each framework tend to be complementary in practical testing applications (Thissen & Orlando, 2001). In fact, some researchers have argued that CTT analyses should routinely precede IRT analyses so that items of poor psychometric quality can be screened out prior to IRT analyses 2 (e.g., Morizot, Ainsworth, & Reise, 2007). Although CTT and IRT are routinely used to develop personality measurement instruments, both can also be productively used to refine existent inventories. In the present study, we primarily used IRT measurement models to evaluate and shorten a measure of locus of control. Overview of CTT CTT has been ubiquitous in test development since at least the 1930s (Embretson & Reise, 2000). Defining works in CTT include Harold Gulliksen’s (1950) Theory of Mental Tests (Embretson & Reise, 2000; Haertel, 2006; Lord & Novick, 1968) and Lord and Novick’s (1968) Statistical Theories of Mental Test Scores (Embretson & Reise, 2000; Haertel, 2006; Thissen & Wainer, 2001). In his classic text, Gulliksen (1950) credited the bulk of the important CTT formulas to the early papers of Charles Spearman. In these papers, Spearman (1904a, 1904b, 1907, 1910, 1913) addressed a fundamental problem in psychological testing: All test scores are influenced by measurement error that obscures their interpretation and methods are needed to assess the impact of error on test scores in order to determine whether they measure anything reliably. A precise definition of CTT is difficult to pinpoint, but Nunnally and Bernstein (1994) tentatively proposed that methods by which test taker attributes are estimated based on linear combinations of responses to individual items on a test comprise CTT. For example, on a measure of happiness with a standardized 5-point response scale for each item, an individual’s happiness score can be expressed as the sum of that individual’s item responses. It is assumed that item responses are on a continuous scale, and that item and test scores are linearly related to the latent construct assessed by the test. Within a CTT framework, individuals are assumed to have a theoretical true score on a test, which is the score that the test taker would achieve given perfect measurement (Nunnally & Bernstein, 1994). Because tests measure latent constructs 3 imperfectly (i.e., they are influenced by measurement error), observed scores differ from true scores. Classical true score theory can be elegantly expressed as: X=T+E where X is the observed score, T is the theoretical true score, and E is error. If it were possible to administer a test an infinite number of times, the average observed score would closely approximate the true score (Nunnally & Bernstein, 1994). A test may not be able to be administered an infinite number of times, but an estimated true score can be obtained using the observed score and error estimates for the test. The above formula can also be represented in terms of variance components such that: σX = σ T + σE where σX is the variance of the observed scores, σT is the variance of the true scores, and σE is error variance. In CTT, error variances are assumed to be normally distributed and unsystematic (Algina & Penfield, 2009). Based on the above partitioning of variances, reliability can be expressed as: ρ= T X That is, reliability (ρ) is conceptualized as the proportion of true score variance contained in a set when the influence of error variance on observed of observed scores. A test is most reliable scores is minimal; thus, a test measures a construct most precisely when reliability is high and error is low. For most psychological tests, CTT measurement precision is typically expressed in terms of global summary statistics such as coefficient α and the overall standard error of measurement for a given test. Test reliability is often conceptualized in terms of internal consistency, which means that item responses caused by the same latent construct should covary (Zinbarg, Yovel, Revelle, & 4 McDonald, 2006). Coefficient α is particularly common index in psychological testing and is widely regarded as a measure of internal consistency because is a function of the item intercorrelations on a test. Coefficient α represents the lower bound on the population reliability of a test score (McDonald, 1999). Within a CTT framework, tests with αs of .80 or greater are generally considered to be reliable (Nunnally & Bernstein, 1994). Coefficient α has some limitations and a number of measurement experts discourage interpreting coefficient α as a measure of internal consistency (Bentler, 2009; McDonald, 1999; Sijtsma, 2009). Unfortunately, coefficient α is influenced substantially by test length (due to the method by which α is computed), with longer tests (and all else equal) more likely to demonstrate higher values of α (Nunnally & Bernstein, 1994). Additionally, tests with overly repetitive items assessing substantively narrow constructs tend to have very high αs but may have little predictive validity, so high values of α are not always indicative of high-quality tests (Horn, 2005). CTT has been a popular framework for test development for decades because it is relatively simple and the statistics it provides are relatively easy to interpret. However, CTT is limited in that the information it provides is sample and test dependent; that is, test and item level statistics are estimated for the sample of applicants for the specific test administered and apply only to that sample. Additionally, test scores are only meaningful in relation to the specific test completed and the population from which the sample that completed it was drawn. For this reason, results of CTT analyses conducted with one group of test takers (e.g., college students) who completed one test do not readily generalize to different populations (e.g., job applicants) or other tests purported to measure the same construct (Yen & Fitzpatrick, 2006). Even within a single population, changes made to a single item can alter the interpretation of the test score. As more items on a test are removed or replaced, previous CTT analyses of the test become less applicable to the new version of the test. 5 Overview of IRT IRT consists of a collection of mathematical models and statistical methods used for item analysis and test scoring (Thissen & Steinberg, 2009). David Thissen and his colleagues (Thissen & Orlando, 2001; Thissen & Steinberg, 2009) have documented the development of IRT, crediting its conceptual foundation to a 1925 paper by Louis Leon Thurstone. In his paper, Thurstone (1925) demonstrated that Binet intelligence test items could be arranged along a continuum representing mental age (M); specifically, the difficulty of an item could be regarded as the point on the M continuum at which 50% of children of that mental age would answer the item correctly. Thus, Thurstone contributed the idea that items and people could be located on the same metric, a concept underlying all IRT models (Thissen & Steinberg, 2009). In 1950, Lazarsfeld offered two other fundamental concepts underlying all IRT models: (a) item responses are driven by a latent variable that (b) explains the observed relationships among a set of item responses (Thissen & Steinberg, 2009). IRT was formally introduced in Lord and Novick’s (1968) classic text within four chapters written by Allan Birnbaum (Thissen & Steinberg, 2009). However, the widespread implementation of IRT was limited until R. Darrel Bock and Murray Aitkin (1981) provided an efficient method for estimating IRT parameters: maximum likelihood method with an EM algorithm; this algorithm is currently the most widely implemented estimation procedure in IRT software programs (Thissen & Steinberg, 2009). IRT models account for the fact that psychological tests, such as personality inventories, often utilize summative response scales with ordered categories rather than continuous measurement scales (Wirth & Edwards, 2007). Individuals’ levels of θ are typically estimated directly from the patterns of item responses using nonlinear models (Edwards, 2009). In a unidimensional IRT analysis, items and individuals are located on the θ scale based on individuals’ patterns of responses to items assessing a common construct. Given appropriate 6 sampling, this allows test developers to evaluate the properties of items as well as the measurement precision of a test relative to the trait levels of the population for which the test is intended. However, the metric of θ is indeterminate and must be fixed before IRT item and person parameters can be estimated. Typically, this is accomplished by setting the sample mean to zero and the sample standard deviation to unity. If the underlying latent trait is assumed to be normally distributed, then the interpretation of the resulting parameters is similar to the interpretation of z scores (Edwards, 2009). For example, an estimated θ of zero indicates that an individual is of average ability relative to the rest of the sample. IRT provides test and item level statistics that are theoretically sample independent; that is, the statistical parameters from different samples can be directly compared after they are placed on a common metric (de Ayala, 2009). Given acceptable model-data fit, the use of IRT in test development provides several advantages not available using CTT analyses, including more complex test designs (von Davier, 2010). IRT can also provide several advantages for researchers revising personality measures. For instance, when changes are made to a test, such as the deletion or replacement of some items, IRT linking and equating procedures allow test developers to meaningfully compare the new measure to older versions of the measure. Although IRT models have many theoretically desirable features, most widely used IRT models involve strong assumptions that must be met, at least to a certain extent, before item and person parameters can be meaningfully interpreted (Embretson & Reise, 2000). Additionally, large sample sizes are typically needed to obtain stable estimates. No definitive sample size guidelines can be provided and expert recommendations depend upon such factors as test length, sample characteristics, item properties, and the specific IRT model used, but many popular models are believed to require sample sizes of 500 test takers or more for accurate estimation (de Ayala, 2009). 7 A variety of IRT models exist, but the most commonly applied IRT models are parametric logistic models (Thissen & Orlando, 2001). According to Thissen and Orlando (2001), logistic models are prevalent because they tend to provide theoretically and practically meaningful approaches to describing item response data. Item responses can be modeled as probabilistic functions of the psychometric properties of items and latent trait scores, given that the properties of the items and individuals have been estimated appropriately. The matter of which IRT model, if any, is appropriate for a given application depends in part upon the number of response options associated with each item. Items with two response options are called dichotomous items (e.g., true/false scales). Items with more than two response options are called polytomous items (e.g., Likert scales). Edwards (2009) identified two models that have been of particular interest to psychologists: the 2 parameter logistic model (2PLM; Birnbaum, 1968) for dichotomous items and the logistic graded response model (GRM; Samejima, 1969, 1997) for polytomous items. The 2PLM Using notation similar to Thissen and Steinberg (2009), the probability of endorsing an item under the 2PLM can be expressed as: Pi ( x i 1 | ) 1 1 e a i ( bi ) where Pi(xi = 1 | θ) is the probability of endorsing the item i in the keyed direction (xi = 1) given an individual’s latent trait score (θ), ai is the item slope parameter, and bi is the item location parameter. This definition assumes that all parameter values for individuals (θ) and items (ai, bi) are known (i.e., they have been estimated). Figure 1 displays a 2PLM trace line, a graphical depiction of the item properties of a 2PLM item. The location parameter is measured at the point on the trace line at which examines of the corresponding trait level (θ) have a 50/50 chance of 8 endorsing the keyed response. This occurs at the point of inflection on the 2PLM trace line. The item in Figure 1 has a b value of 0. Based on the 2PLM, an individual of average ability (i.e., with a θ of 0) would have approximately a 50% probability of endorsing this item. Figure 1. 2PLM trace line. This function indicates the relationship between the latent trait (θ) and the probability of choosing the keyed alternative. Conceptually, the location parameter accounts for the fact that some items are easier or more difficult to endorse than others (i.e., they have different b values). Under the 2PLM, individuals with higher levels on the latent construct are more likely to respond to a difficult-toendorse item in the keyed direction. Figure 2 provides an example of two items with different b values. Individuals of average ability would have a 50% probability of endorsing item 1 (b = 0), whereas individuals would need to be a standard deviation above average on (if a Gaussian distribution were assumed) in order to have a 50% probability of endorsing item 2 (b = 1). Thus, item 2 is more difficult to endorse than item 1. For example, on a scale with higher scores indicative of greater happiness and a dichotomous response format (e.g., true/false), endorsing item 2 would be indicative of greater happiness than endorsing item 1. 9 Figure 2. 2 PLM trace lines with different location (b) parameters. Item 2 is more difficult to endorse than item 1. The slope (a) parameter is analogous to a factor loading in factor analysis (McDonald, 1999). However, because the 2PLM is a nonlinear model, the slope is not constant across θ (whereas the relationship between an item factor loading and the latent trait is assumed to be linear in most factor analytic models). The slope parameter is measured at the point of inflection on the trace line, which is the point at which the trace line is the steepest. Conceptually, the slope parameter indicates how well the item differentiates between test takers with different levels of θ. More discriminating items provide more information about a test taker’s score on the latent construct because these items are more related to the latent construct. Figure 3 illustrates two trace lines with the same item locations but different slopes. Item 4 (a = 2) is more strongly related to θ than item 3 (a = 1) because it has a steeper slope. Because item 4 is more strongly related to θ, it provides the more information about an individual’s location on the θ scale than item 3. Continuing with our hypothetical measure of happiness, item 4 would be more indicative of happiness than item 3. 10 Figure 3. 2 PLM trace lines with different slope (a) parameters. Item 4 is more discriminating than item 3. The GRM The GRM is an extension of the 2PLM for items with polytomous response scales. For each item, the GRM incorporates a single slope (a) parameter and multiple location (b) parameters. The number of estimated location parameters for an item with k categorical response options is equal to k – 1, or one less than the number of response options. Under the GRM, item response categories are assumed to be ordered, such that higher levels of θ correspond to respondents selecting a higher response category (e.g., strongly agree rather than agree). By convention, the first response option is set to 0. Thus, the options for an item with k response options are scaled as 0, 1, … k – 1. Each location parameter corresponds to the probability of selecting category k or higher. For example, b1 indexes the probability of selecting category 1 or higher, given the conventional approach to scaling. An interesting feature of the GRM is that the distance (on the θ scale) between the response options is freely estimated for each item; that is, the CTT assumptions that response options are on an equal interval scale and the response scale 11 has a consistent meaning across items do not apply under the GRM. This is because each location parameter is estimated as a separate 2PLM item, as displayed in the upper panel of Figure 4. The GRM for an item with five response options can be expressed as: 1 1 P k| ) ai (bi,k) ai (bi,k1) i(x i 1e 1e or, more simply: Pi ( xi k | ) Pi* (k ) Pi* (k 1) where all variables are as previously defined. The equations above make it possible to graph the probability of responding in each category as a function of . This is depicted in the lower panel of Figure 4, in which increased levels of correspond to selecting progressively higher options on the response scale. 12 Figure 4. GRM trace lines. The upper panel illustrates the meaning of the b values for a 5 option item. The lower panel illustrates the model-based probability of selecting each response option for that same item. Assumptions As with most statistical procedures, psychometric analyses based on IRT have several underlying assumptions. Three primary assumptions of the 2PLM and GRM are unidimensionality, local independence, and monotonicity/functional form. Although these assumptions are never precisely met by real data sets, test developers can be confident in the 13 results of an IRT analysis to the degree that these assumptions hold (Embretson & Reise, 2000). These assumptions are typically made more explicit and testable in IRT, but they can also be related to CTT assumptions, as described below. Dimensionality. In IRT modeling, it is assumed one or more individual difference variables explain the relationships among individuals’ responses to items on a test (McDonald, 1999). Most IRT applications are predicated on the assumption that the observed relationships among item responses can be fully explained by a single continuous (usually normally distributed) individual difference construct. This means that the test is unidimensional. In both CTT and IRT, test scores that represent a single identifiable common construct are the most readily interpretable (Gustafsson & Åberg-Bengtsson, 2010; Thissen & Orlando, 2001). Unfortunately, no real test is purely unidimensional, which presents a quandary for researchers wishing to apply unidimensional IRT models. As a result, there has been much debate in the IRT literature concerning the appropriateness of a wide variety of dimensionality assessment methods (see Embretson & Reise, 2000; Gustafsson & Åberg-Bengtsson). The goal of many of these methods is to help the investigator judge whether a test is “unidimensional enough” to warrant the application and substantive interpretation of a unidimensional IRT model. Generally speaking, observed data are often consistent with a number of different dimensional structures and the true dimensionality is not known. However, some researchers have argued that unidimensional IRT modeling may be acceptable if the researcher can first provide evidence of a strong general factor in the data. A common approach to addressing the dimensionality problem is to use a combination of exploratory and confirmatory factor analytic methods designed for ordinal data prior to fitting an IRT model (Ackerman, Gierl, & Walker, 2003; Wirth & Edwards, 2007). These methods utilize tetrachoric correlations for dichotomous item responses, polychoric correlations for polytomous 14 item responses, and appropriate estimation methods. Many fit indices have been developed for use in evaluating linear confirmatory factor analysis models designed for continuous variables, and cutoff values have been proposed for these indices to aid researchers in evaluating model fit (e.g., Browne & Cudeck, 1993; Hu & Bentler, 1999). Some IRT researchers have endorsed evaluating confirmatory ordinal factor analytic models using these benchmark cutoffs in assessing whether item response data are unidimensional enough for IRT modeling (e.g., Edwards, 2009). However, others have argued that this approach may be be unsatisfactory and that alternative methods are needed to assess whether data are udimensional enough for IRT (Cook, Kallen, & Amtmann, 2009; Reise, Moore, & Haviland, 2010; Reise, under review). One approach that has been proposed as a framework for addressing this problem is the use of bifactor modeling (e.g., Reise et al., 2010; Reise, under review). In a bifactor model, all items load onto a general factor that is expected to represent the construct of interest. In the most interpretable type of bifactor structure, each item also loads onto one and only one group factor, and the general and group factors are constrained to be orthogonal. The general factor is believed to reflect the construct the researcher intends to measure, whereas the group factors represent multidimensionality inherent in the data. The bifactor model can be compared to a unidimensional model. If the data are unidimensional enough for IRT, the factor loadings on the general factor should be similar to the factor loadings obtained from a unidimensional solution. Substantial differences suggest that forcing multidimensional data into a unidimensional IRT model may result in an invalid solution. Local Independence. This assumption, also called conditional independence, is closely related to the dimensionality assumption (Embretson & Reise, 2000). Local independence means that, after accounting for the common construct (or constructs) that a test measures (i.e., after fitting the IRT model), there is no remaining relationship among the items (de Ayala, 2009; Yen, 15 1993). Local independence is typically operationally defined such that, after controlling for , no substantial relationship remains between any pair of items. This is analogous to the assumption in CTT and factor analysis that, after controlling for the common factor, the residual item variances or error variances are uncorrelated. If a set of items contains many locally dependent item pairs, this indicates the presence of multidimensionality that is not accounted for by the model (Thissen & Steinberg, 2010), which may suggest that a unidimensional model is not appropriate. Conversely, the absence of local dependencies implies unidimensionality (Edwards, 2009). Monotonicity/Functional Form. The monotonicity and functional form assumptions are closely related, but distinct. Both refer to the relationship between the latent construct presumed to underlie item responses and the probability of selecting a particular item response category. If the monotonicity assumption is met for a 2PLM item, then the probability of endorsing an item increases as a function of θ. For a GRM item, it is expected that individuals with greater levels of θ will more strongly endorse an item measuring θ (e.g., select strongly agree rather than agree on a summative response scale). The functional form assumption is met to the extent that observed item response patterns are consistent with the item response functions predicted by the CTT or IRT model used to analyze a given data set. In CTT, the expected function is a straight line because CTT is based on the general linear model. Under the 2PLM, the expected functional form assumed to fit the data is the type of logistic “S”-shaped curve that was depicted in Figure 1. Because 2PLM trace lines model a probabilistic relationship, the function is bounded by zero at the lower asymptote and unity at the upper asymptote; that is, the endorsement probability cannot be less than zero or greater than one. GRM item responses are expected to conform to the type of function that was depicted in Figure 4. 16 Advantages of Using IRT in Test Revision Given acceptable model-data fit, the methods and models of IRT can provide researchers with powerful tools for revising personality tests. These advantages are due primarily to IRT conceptualizations of measurement precision. As with IRT item-level functions (e.g., trace lines), IRT measurement precision for a test is typically represented graphically as a function of θ and allowed to vary across the θ continuum (as opposed to CTT, where overall summary statistics are more commonly reported). The following test-level functions are related to IRT measurement precision: the test information curve, test standard error curve, and test characteristic curve. Test Information Curve. A test information curve depicts the amount of information that a test, including all items, provides for estimating each individual’s location on the θ scale (de Ayala, 2009). The shape of this function is directly related to the IRT parameters of a set of test items; specifically, test information is greatest where the test items are concentrated (i.e., more b values are located) and the items are more related to θ (i.e., the a values are larger). A test best differentiates between individuals of different θ levels at the point at which test information is the highest. A test differentiates less well at θ levels where the test information function is lower. If a sufficient pool of items has been generated to assess a latent construct and then appropriately calibrated using IRT, test information curves can be engineered for specific purposes. For instance, if measurement precision is desired around a certain cut score (e.g., for a pass/fail test), a test can be designed with item location parameters concentrated near the cut score in order to optimally differentiate individuals who are above and below the cut score. Alternatively, if precision is desired along the entire range of the latent construct, as is often the case with personality measures, it can be designed to include item location parameters that span the θ continuum in order to measure precisely across the range of θ. 17 Test Standard Error Curve. A test standard error curve depicts the amount of measurement error in a test, controlling for θ. It is computed as the inverse of the square root of information at each θ level (Yen & Fitzpatrick, 2006). Where the curve is lower, there is less error in estimating an individual’s location on the θ scale. Conversely, where the curve is higher, there is more error and thus less precision in estimating individuals’ θ scores. The θ at which the standard error function is the lowest corresponds to the peak of the test information function; thus, IRT measurement precision for a test is maximized for the θ (or range of θs) in which information is the highest and error is the lowest. Test Characteristic Curve. A test characteristic curve graphically illustrates the relationship between θ and predicted summed scores on a test. Predicted summed scores are analogous to CTT estimated true scores and are computed using the IRT properties of the items on the test (Hambleton, Swaminathan, & Rogers, 1991). This computation is possible because IRT models are probabilistic; more specifically, every level of θ is associated with some probability of selecting each response option for an item, as depicted on trace lines. Additionally, each option is associated with a specific value (the same type of whole-number value that is typically used to compute observed summed scores). For a dichotomous item, the non-keyed and keyed options are assigned values of 0 and 1, respectively. As previously discussed, when polytomous items are analyzed with IRT the options are assigned values of 0, 1,… k – 1. The test characteristic curve is a linear combination of the trace lines for all of the items on a test (Yen & Fitzpatrick, 2006). Specifically, when θ is known, the probability of selecting option k is multiplied by the numerical value of option k for every option and every test item. These values are then summed, resulting in a predicted summed score for that θ. Predicted summed scores look similar to actual summed scores, but are not limited to whole numbers due to 18 the method by which they are computed (i.e., they typically have decimal values). Predicted summed scores should increase as θ increases, although this relationship is not necessarily linear. Present Study Locus of control (Rotter, 1954, 1966) is conceptualized as a personality construct encompassing an individual’s general beliefs about the causes of environmental reinforcement. Those with a more internal locus of control believe their own actions can bring about desired events, whereas those with a more external locus of control believe that whether they receive reinforcement generally depends on luck, chance, fate, or other factors external to themselves (Lefcourt, 1966, 1982, 1992; Rotter, 1975, 1990). Rotter (1990) conceptualized locus of control as a theoretically broad construct with diverse and somewhat loosely associated behavioral indicators. The first inventory that was designed to measure locus of control was the Rotter I-E scale (Rotter, 1966). Although the Rotter I-E scale has been widely used in psychological research, many measures of locus of control have been developed. Some are general like the I-E, some are conceptually narrow (e.g., work locus of control), and some are multidimensional (see Furnham & Steele, 1993; Goodman & Waters, 1987; Kormanik & Rocco, 2009; Lefcourt, 1982). One conceptually broad alternative to the I-E scale is the Internal Control Index (ICI; Duttweiler, 1984), a 28-item measure of locus of control for adults. A single score represents individuals’ standing on the construct, with higher scores indicating greater levels of internal locus of control. This instrument has generally performed favorably in CTT psychometric analyses (Duttweiler, 1984; Jacobs, 1993; Maltby & Cope, 1996; Meyers & Wong, 1988). Values of coefficient α for the ICI (.83 to .85) tend to be somewhat higher than values for the Rotter I-E scale (.75 to .77; Duttweiler, 1984; Goodman & Waters, 1987; Meyers & Wong, 1988), but the two measures 19 appear to demonstrate comparable patterns of correlations with related constructs (Meyers & Wong, 1988). Archer (1979) reviewed research literature on locus of control and anxiety, and reported that measures of internal locus of control tend to be negatively associated with measures of trait anxiety (Archer, 1979). Additionally, previous empirical work on achievement goal orientation has suggested that an internal locus of control is positively associated with intrinsic motivation to learn (learning orientation) and unassociated with extrinsic goals to perform well in academics (performance orientation; Heintz & Steele-Johnson, 2004; Phillips & Gully, 1997). After metaanalytically examining the relationships between subjective well-being (operationalized as satisfaction with life, high positive affect, and low negative affect) and a myriad of other variables, DeNeve and Cooper (1998) concluded that locus of control was one of the best predictors of subjective well-being. Although the ICI has performed reasonably well in previous research, some potential weaknesses in its internal structure have been identified. For example, Goodman and Waters (1987) reported that the ICI did not demonstrate acceptable levels of convergent validity with scales of four other inventories that were purported to measure various dimensions of locus of control, suggesting that locus of control might be a multidimensional construct. Additionally, CTT item analyses (Duttweiler, 1984; Jacobs, 1993; Maltby & Cope, 1996; Meyers & Wong, 1988) have shown that some items within the ICI tend to correlate relatively weakly with the total test score. Exploratory dimensionality analyses have been conducted on the ICI in two studies, both utilizing orthogonal varimax rotation. Based on her principal axis factor solution, Duttweiler (1984) suggested that the inventory was composed of two factors, whereas Meyers and Wong (1988) utilized principal components analysis and suggested that the inventory assessed three principal components, with multiple items not loading on a factor/component in both studies. We 20 hypothesized that the psychometric properties of the ICI could be clarified and improved using an IRT measurement model and other methods appropriate for ordinal data. Because the underlying theory specifies that general locus of control is a single (albeit conceptually broad) construct (Rotter, 1975, 1990), we planned to identify a unidimensional subset of ICI items and then evaluate the construct validity of the original and revised versions of the ICI by comparing their correlation patterns with measures of related constructs. 21 Chapter 2 METHOD Participants and Procedure Data were gathered at California State University, Sacramento (CSUS) in two separate studies. Study 1 was conducted during the Fall 2010 semester and Study 2 was conducted during the Spring 2011 semester. Both studies incorporated the ICI, but no other variables overlapped. Undergraduate students who were enrolled in introductory psychology courses at CSUS (N = 631) completed packets of inventories for course credit. Each participant packet in each study contained a demographics sheet followed by the inventories. Inventories were presented in a different random order for each participant. Studies 1 and 2 included 310 and 321 participants, respectively. Participant demographic data for the two studies are provided in Table 1. In both samples, participants were overwhelmingly female (about 80%), and both samples were ethnically diverse. Average ages were 20.6 years (SD = 4.4, 17 – 57) in Study 1 and 20.5 years (SD = 3.6, 16 – 50) in Study 2. 22 Table 1 Participant Demographic Data Study 1 Sex Female Male Not Reported Ethnicity American Indian African American Asian American European American Hispanic/Latino Pacific Islander Mixed Ethnicities Other Not Reported Study 2 Frequency % Frequency % 240 68 2 77.4 21.9 .6 261 59 0 81.3 18.4 .0 2 20 79 114 43 10 35 6 1 .6 6.5 25.5 36.8 13.9 3.2 11.3 1.9 .3 3 21 64 114 57 14 31 15 1 .9 6.5 19.9 35.5 17.8 4. 9.7 5.0 .3 Materials ICI The ICI was administered during both studies. The 28 items of the ICI are provided in Table 2. Participants utilize a 5-point response scale (1 = rarely, 2 = occasionally, 3 = sometimes, 4 = frequently, 5 = usually) to respond to the items. After recoding the 14 reverse scored items (indicated in Table 2), a mean or summed score is computed on the 28 items. A single score is used to represent individuals’ standing with higher scores indicating greater levels of internal locus of control. Values of coefficient α were similar for the two studies (Study 1, α = .85; Study 2, α = .81). 23 Table 2 ICI Items Item 1 2 3 4 5 6 Content When faced with a problem I try to forget it.a I need frequent encouragement from others to keep working at a difficult task.a I like jobs where I can make decisions and be responsible for my own work. I change my opinion when someone I admire disagrees with me.a If I want something I work hard to get it. I prefer to learn the facts about something from someone else rather than having to dig them out myself.a I will accept jobs that require me to supervise others. 7 I have a hard time saying “no” when someone tries to sell me something.a 8 I like to have a say in any decisions made by any group I’m in. 9 10 I consider the different sides of an issue before making any decisions. 11 What other people think has a great influence on my behavior.a 12 Whenever something good happens to me I feel it is because I earned it. 13 I enjoy being in a position of leadership. 14 I need someone else to praise my work before I am satisfied with what I’ve done.a 15 I am sure enough of my opinions to try to influence others. 16 When something is going to affect me I learn as much about it as I can. 17 I decide to do things on the spur of the moment.a 18 For me, knowing I’ve done something well is more important than being praised by someone else. 19 I let other peoples demands keep me from doing things I want to do.a 20 I stick to my opinions when someone disagrees with me. 21 I do what I feel like doing, not what other people think I ought to do. 22 I get discouraged when doing something that takes a long time to achieve results.a 23 When part of a group I prefer to let other people make all the decisions.a 24 When I have a problem I follow the advice of friends or relatives.a 25 I enjoy trying to do difficult tasks more than I enjoy doing easy tasks. 26 I prefer situations where I can depend on someone else’s ability rather than my own.a 27 Having someone important tell me I did a good job is more important to me than feeling I’ve done a good job.a 28 When I’m involved in something I try to find out all I can about what is going on, even when someone else is in charge. a Item is reverse scored. 24 Study 1 Variables Study 1 included measures designed to assess academic goals and related variables. The inventories relevant to locus of control were the State-Trait Anxiety Inventory – Form Y (STAI; Spielberger, Gorsuch, Lushene, Vagg, & Jacobs, 1983) and the Achievement Goal QuestionnaireRevised (AGQ-R; Elliot & Murayama, 2008). The STAI consists of two scales assessing state and trait anxiety, with higher scores indicating greater levels of anxiety. The trait anxiety scale (α = .92) was used with its 4-point response scale. The AGQ-R assesses four dimensions of goal orientation utilizing a 5-point response scale. The mastery approach (α = .75) and performance approach (α = .81) dimensions were of interest in the present study. A higher score on the mastery approach dimension is indicative of a learning orientation, whereas a higher score on the performance approach dimension is indicative of a performance orientation. Study 2 Variables Study 2 included inventories designed to assess psychological well-being and related constructs. Of these, the inventories most relevant to locus of control were the Meaning in Life Questionnaire (MLQ; Steger, Frazier, Oishi, & Kaler, 2006) and the Positive and Negative Affect Schedule – Expanded Form (PANAS-X; Watson & Clark, 1994), both utilizing 5-point response scales. The presence scale (α = .87) of the MLQ was used, with higher scores indicating greater levels of subjective meaning. The positive (α = .86) and negative (α = .87) affect scales of the PANAS-X were used, with higher scores indicating greater tendencies to experience positive and negative moods, respectively. Software Data were analyzed using PASW (SPSS) Statistics version 18.0, CEFA (Browne, Cudeck, Tateneni, & Mels, 2008), IRTPRO (Cai, du Toit, & Thissen, 2011), and LISREL (Jöreskog & Sörbom, 2004). SPSS was used to prepare data files and perform all CTT analyses. 25 CEFA was used to generate polychoric correlations and to perform exploratory ordinal item factor analyses. In IRTPRO, maximum marginal likelihood estimation with an EM algorithm (Bock & Aitkin, 1981) was used to estimate IRT item and person parameters for unidimensional models. Respondent scores for unidimensional models were computed in IRTPRO using expected a posteriori (EAP) estimation. A Metropolis-Hastings Robbins-Monro algorithm was used to estimate multidimensional factor analytic and IRT model parameters in IRTPRO. LISREL was used to fit confirmatory ordinal factor models. 26 Chapter 3 RESULTS Participants for the two studies were compared on ICI scores to determine if the ICI data from the two studies could be combined. The following steps were then taken: the psychometric properties of the ICI were evaluated; the ICI was revised using IRT, resulting in the Internal Control Index-Revised (ICI-R); and the psychometric properties of the ICI-R were evaluated. Finally, the ICI and ICI-R were compared and correlated with other psychological constructs to provide evidence as to whether the revision changed the meaning of the latent construct (θ) measured by the inventory. Comparison of ICI Data from the Two Studies A one-way between-subjects analysis of variance was used to compare ICI mean scores from Study 1 and Study 2 to determine if the data could be combined. No statistically significant difference was found on ICI score between the participants who completed Study 1, M = 3.66, 95% CI [3.61, 3.72], SD = .46, and those who completed Study 2, M = 3.62, 95% CI [3.58, 3.67], SD = .43, F(1, 596) = 1.30, p = .25, 2 = .002. Cohen’s d was found to be .09, suggesting that the data were not substantially influenced by context effects caused by their inclusion in different studies. Thus, the data from the two studies were combined into a sample of 631 for all of the remaining analyses of the ICI. 27 CTT Psychometric Properties of the ICI The coefficient α of the scale using data from both studies combined was .83. CTT item statistics for the ICI are provided in Table 3, including item means, corrected item-total correlations, and the percentage of participants who endorsed each response option. Two items with particularly low corrected item-total correlations were identified; items 17 (“I decide to do things on the spur of the moment.”) and 24 (“When I have a problem I follow the advice of friends or relatives.”) had corrected item-total correlations of .04 and .12, respectively, and did not seem as relevant to locus of control as some other items on the scale; they were therefore removed from all subsequent ICI scale analyses. Item responses were generally evenly balanced across categories three through five. However, for most items, the first (M = 4.4%, SD = 3.4%) and second (M = 11.4%, SD = 7.0%) response options were selected relatively less frequently and so were collapsed to facilitate item factor analyses and IRT analyses. The percentage of missing data by item was very low (NR 1.1%). However, 33 individuals declined to respond to at least one ICI item. Because the missing responses were so sparse and evenly distributed across items, we elected to simply use listwise deletion, resulting in a sample size of 598 for all subsequent analyses. 28 Table 3 CTT Item Statistics for the ICI Corrected Item-Total Correlation .21 .41 .39 .41 .43 .32 .39 .28 .41 .35 .43 .25 .43 .31 .34 .39 .04 .38 .41 .50 .42 .34 .53 .12 .34 .42 .38 .35 Item M 1 3.86 2 3.56 3 4.04 4 3.97 5 4.40 6 3.38 7 3.72 8 3.64 9 3.92 10 4.10 11 3.27 12 3.91 13 3.44 14 3.51 15 3.37 16 3.88 17 2.97 18 3.90 19 3.82 20 3.86 21 3.70 22 3.21 23 3.80 24 2.57 25 3.12 26 3.94 27 3.53 28 3.63 Min. Max. M SD a No response (missing data). 1 3.3 4.9 1.0 1.4 .3 6.3 4.6 8.7 2.1 .8 6.8 2.2 7.6 6.0 4.8 1.6 12.4 2.1 2.7 1.1 2.2 6.5 1.6 13.3 6.8 2.1 6.7 2.9 .3 13.3 4.4 3.4 Response Option (% Endorsement) 2 3 4 5 5.5 28.2 26.1 36.5 12.0 30.0 26.9 26.0 4.8 19.5 38.4 36.3 5.9 26.8 26.0 39.6 2.2 10.3 30.9 55.8 14.9 33.3 25.8 19.3 10.8 23.1 30.9 30.0 12.8 19.5 23.5 35.0 8.4 20.3 35.7 33.3 4.4 17.6 37.4 39.6 18.2 32.3 26.0 16.2 5.1 22.5 39.1 30.9 14.7 27.6 25.5 24.4 14.3 25.4 29.6 24.2 12.5 36.8 33.0 12.7 8.1 22.5 36.1 31.4 21.6 33.4 21.1 10.5 7.0 23.8 32.8 34.1 8.9 24.4 31.2 32.6 7.9 23.3 41.0 26.3 9.2 27.1 37.4 23.5 17.6 35.8 26.6 12.8 8.4 27.1 34.5 28.1 37.1 32.8 11.6 4.6 16.6 43.7 21.6 10.8 5.7 23.3 35.2 33.4 14.3 24.7 28.4 25.5 11.1 30.0 33.8 22.2 2.2 10.3 11.6 4.6 37.1 43.7 41.0 55.8 11.4 26.6 30.2 27.0 7.0 6.8 6.6 11.0 NRa .3 .2 .2 .3 .5 .3 .6 .5 .3 .2 .5 .2 .2 .5 .3 .3 1.1 .3 .2 .3 .6 .6 .3 .6 .5 .3 .5 .2 29 Dimensionality Assessment of the ICI Exploratory item factor analyses were conducted on the 26 remaining items to investigate the dimensional structure of the data. First, a matrix of polychoric correlations was computed in CEFA. Upon examination of the correlation matrix, it was evident that several reverse scored items did not seem to be at all correlated with some of the other items on the inventory. Six eigenvalues were greater than one (6.4, 2.8, 1.6, 1.4, 1.2, & 1.1, respectively). Exploratory item factor analyses were performed on the polychoric correlations in CEFA using ordinary least squares extraction and oblique quartimax rotation. Oblique quartimax rotation is equivalent to direct quartimin rotation (Browne, 2001) and has been endorsed by several prominent factor analysts (Browne, 2001; Edwards, 2009; Preacher & MacCallum, 2003). For the ICI, one, two, and three dimensional factor solutions were explored. Factor loadings (pattern coefficients) of .40 or larger were considered to be meaningful. Residual correlations were examined to determine if the factor solutions accounted reasonably well for the observed inter-item covariation. As suggested by Morizot et al. (2007), residual correlations of .20 or greater were flagged as evidence of a factor solution that did not adequately account for the structure of the data. Extraction of a single factor resulted in seven items (1, 6, 8, 12, 14, 22, & 27) not loading meaningfully on the factor and 16 large residual correlations (r .20), suggesting an inadequate solution. The two factor solution consisted primarily of straightforwardly worded items loading onto Factor 1 and reverse scored items loading onto Factor 2. The two factors were moderately intercorrelated (r = .32). However, five items (1, 6, 18, 20 & 23) did not load clearly on a factor and four large residual correlations remained. In the three factor solution, seven items (1, 5, 6, 12, 20, 21, 23, & 25) did not load meaningfully onto a factor and the factors themselves did not appear to be particularly interpretable, although only one large residual correlation remained. The two and three factor solutions did not correspond well to the two factor orthogonal principal axis 30 factor solution reported by Duttweiler (1984) or the three factor orthogonal principal components solution reported by Meyers and Wong (1988). It is unclear whether this was due to the differences in the data analytic methods used, sampling error, or the unclear factor structure of the ICI. Regardless, we determined that the ICI was in need of revision in order to clarify its factor structure. IRT Psychometric Properties of the ICI The GRM was fit to the 26 items. The M2 goodness-of-fit statistic (Maydeu-Olivares & Joe, 2005; Cai, Maydeu-Olivares, Coffman, & Thissen, 2006) is a χ2 distributed statistic and was used to assess overall IRT model-data fit. Smaller values of M2 indicate better model-data fit, but some misfit is generally expected when strong parametric IRT models such as the GRM are applied to real data (Cai et al., 2011). For this reason, the M2 statistic was supplemented by a root mean error of approximation (RMSEA) statistic. This index is not as sensitive to sample size or over-parameterization as χ2 distributed statistics (Browne & Cudeck, 1993) and is routinely used to assess the fit of structural equation models. Maydeu-Olivares, Cai, and Hernández (2011) demonstrated that this index tends to yield similar interpretations whether a confirmatory factor analysis or IRT model is applied. Specific cutoff values for interpreting RMSEA are not fully agreed upon; however, we chose to use a cutoff of .06 based on the work of Hu and Bentler (1999). Thus, values less than .06 indicate acceptable fit and larger values indicate misfit. Some misfit was indicated for the reduced 26-item version of the ICI, M2(247) = 1349.98, p < .001, RMSEA = .09. Trace line fit and local independence were also assessed. The functional form assumption was evaluated using the S-X2 statistic (Orlando & Thissen, 2000; 2003) and a nominal alpha level of .05. For each item, the S-X2 statistic summarizes the relationship between the trace line(s) predicted by the model and the empirical trace line(s) based on summed scores. A statistically 31 significant value indicates item-model misfit because the item response function predicted by the model differs from the pattern observed in the data. Values of S-X2 indicated poor trace line fit for six ICI items. The local item independence assumption was evaluated using a G2-based local dependence diagnostic statistic (herein referred to as LD), which was developed by Chen and Thissen (1997). Values greater than 10 on this index indicate substantial local item dependence (Cai et al., 2011). A total of 325 item pairs were checked for LD in each sample. Six pairs (2 & 14, 4 & 11, 7 & 13, 14 & 18, 14 & 27, 16 and 28) demonstrated local dependence (LD values > 10). The IRT item parameters for the 26-item version of the ICI are provided in Table 4. According to de Ayala (2009), good values for item slope (a) parameters range from approximately .8 to 2.5. According to DeMars (2010), item location (b) parameters for useful items range from approximately -2 to 2. The ICI a parameter values ranged from .43 to 1.47 and the b parameter values ranged from -5.72 to 3.33. Based on de Ayala’s guideline, some of the a parameter values were quite low, indicating that the items likely do not assess the latent construct (θ) underlying responses to the ICI items with steeper slopes. However, 18 of the 26 items had a parameter values greater than .80. Additionally, the first location parameter value (b1) for item 1 was particularly low. Location parameter values for the remaining items were generally reasonable. 32 Table 4 IRT Item Parameter Estimates for the 26-Item Version of the ICI Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 25 26 27 28 a .43 .74 1.18 .84 1.24 .67 1.12 .53 1.22 .95 .89 .73 1.21 .53 .97 1.10 .88 .85 1.47 1.13 .62 1.31 .85 .98 .68 .92 (SE) (.09) (.09) (.12) (.10) (.13) (.09) (.11) (.09) (.12) (.11) (.10) (.09) (.12) (.09) (.10) (.11) (.10) (.10) (.13) (.11) (.09) (.12) (.10) (.11) (.09) (.10) b1 (SE) -5.72 (1.14) -2.38 (.30) -2.84 (.26) -3.35 (.39) -3.50 (.35) -2.19 (.30) -1.88 (.18) -2.60 (.44) -2.24 (.20) -3.46 (.37) -1.43 (.17) -3.75 (.48) -1.31 (.13) -2.76 (.46) -1.91 (.20) -2.41 (.23) -2.97 (.33) -2.68 (.30) -2.17 (.17) -2.14 (.20) -2.00 (.30) -2.11 (.18) -1.55 (.19) -2.97 (.31) -2.18 (.30) -2.33 (.25) b2 -1.27 -.21 -1.15 -.95 -1.91 .25 -.53 -.75 -.92 -1.50 .35 -1.34 -.02 -.31 .14 -.90 -.98 -.79 -.74 -.52 .70 -.55 .94 -1.01 -.34 -.41 (SE) (.31) (.12) (.12) (.14) (.17) (.13) (.09) (.19) (.10) (.17) (.11) (.19) (.08) (.17) (.10) (.11) (.14) (.13) (.08) (.09) (.17) (.08) (.14) (.13) (.13) (.10) b3 1.41 1.59 .59 .54 -.26 2.27 .90 1.19 .71 .52 2.07 1.22 1.18 2.32 2.24 .85 .86 .97 .92 1.24 3.33 .90 2.72 .82 1.67 1.55 (SE) (.34) (.23) (.10) (.13) (.08) (.32) (.12) (.25) (.10) (.12) (.24) (.19) (.13) (.40) (.23) (.12) (.14) (.15) (.10) (.14) (.48) (.11) (.31) (.13) (.25) (.19) IRT-Based Revision of the ICI The IRT portion of the revision process was accomplished iteratively. After the preliminary analysis, the eight items with a parameters below .80 were deleted based on de Ayala’s (2009) guideline. Additionally, only one item from each locally dependent pair was retained because locally dependent items are, to some extent, psychometrically redundant (Thissen & Steinberg, 2010) and LD can bias slope parameter estimation (Chen & Thissen, 1997; Yen, 1993). The item retained from each pair was the one that appeared to be more clearly written and subjectively more relevant to the core construct. Three of our research assistants, all 33 of whom had a basic understanding of locus of control theory, helped the author to make these judgments. In some cases, items with substantial LD and low slopes overlapped (2 items after the preliminary analysis), but this was generally not the case. The items with LD were judged to be very similar in meaning. Items were pruned to remove those with low slopes (a < .80) and reduce LD until only the 11 items of the ICI-R remained. Dimensionality Assessment of the ICI-R A dimensionality assessment was conducted on the 11-item ICI-R, involving a variety of exploratory and confirmatory factor analyses. A matrix of polychoric correlations was computed in CEFA and the first four eigenvalues were 4.0, 1.2, 1.0, and .83, respectively. One, two, and three dimensional exploratory solutions were considered. Factor loadings for the one dimensional solution ranged from .41 to .66 and all residual correlations were less than .20, suggesting that a unidimensional model was plausible. The two dimensional solution resulted in two highly correlated (r = .58) but interpretable factors. Factor 1 represented independence (items 5, 10, 16, 18, 21, & 25) and Factor 2 represented leadership (items 3, 9, 13, 15, & 23). The three dimensional solution was not interpreted because it was not clearly structured. A two dimensional full information exploratory factor analysis conducted in IRTPRO corroborated the two dimensional factor structure from the CEFA solution, with similar factor loadings and a correlation of .57 between the two factors. Confirmatory factor analyses were conducted in LISREL to evaluate the unidimensional and two dimensional solutions. Diagonally weighted least squares estimation was used on a matrix of polychoric correlations, as recommended by Wirth and Edwards (2007). We planned to evaluate the fit of the unidimensional solution using the RMSEA, the comparative fit index (CFI), the goodness of fit index (GFI), and the root mean square residual (RMSR). Planned cutoffs for judging fit were based on the recommendations of Hu and Bentler (1999), who suggested that 34 acceptable models have RMSEA < .06, CFI > .95, GFI > .95, and RMSR < .08. However, for all tested RMSEA was estimated to be 0 and CFI was estimated to be 1.0, suggesting that these statistics may not have been computed correctly. As a result, only the GFI and RMSR values were interpreted. The fit of the unidimensional model was acceptable, GFI = .95, RMSR = .07. The fit of the two dimensional solution was also acceptable, GFI = .97, RMSR = .05, and the two factors were highly correlated, r = .73. Because the use of such fit indices is controversial (e.g., Cook et al., 2009; Morizot et al., 2007), bifactor analysis was used to further explore the viability of the unidimensional model. The methods used to evaluate the bifactor analytic results were based primarily on the work of Steve Reise and his colleagues (Reise, under review; Reise et al., 2010). CEFA was used to conduct an exploratory bifactor analysis with an orthogonal target rotation. Target rotation requires the researcher to specify which items are expected to load onto which factors, as is done in a confirmatory factor analysis. The difference is that only on-factor loadings are estimated in a confirmatory solution, whereas both on-factor and off-factor loadings are estimated in a target rotated exploratory solution. Exploratory target rotation enables the researcher to identify any substantial off-factor loadings that may cause problems in a confirmatory solution. In the present case, the target matrix specified that all items would load onto the general factor and one of two group factors. The group factors were based on the two dimensional exploratory solution, with the factors representing independence and leadership, respectively. No substantial off-factor loadings were identified (the largest equaled .14). Items generally loaded more strongly onto the general factor than their respective group factors, suggesting the presence of a dominant general factor. LISREL was used to evaluate the bifactor model and the fit was acceptable, GFI = .97, RMSR = .04. The results of this analysis were used to compute Sijtsma’s (2009) 35 unidimensionality index to determine the proportion of common variance accounted for by the general factor. The obtained value of .68 suggested that more than two thirds of the common variance could be attributed to the general factor, whereas the remainder of the common variance was divided across the two group factors. There is no recommended cutoff for using this statistic to determine if a dataset is unidimensional enough for IRT, but the obtained value provided further evidence of a strong general factor. Coefficient omega hierarchical (ωH; McDonald, 1999) is another somewhat controversial statistic that can be used to evaluate bifactor models and the viability of dividing items into subscales (Reise, under review). Given an orthogonal solution, ωH represents the proportion of the total item variance that is uniquely accounted for by a given factor. Values of ωH for the general, independence, and leadership factors were .51, .10, and .18, respectively, suggesting that the general factor accounted for about half of the total variance and the group factors contributed little unique information. Based on these values, the general factor was judged to be only somewhat reliable, whereas the group factors were not at all reliable after controlling for the influence of the general factor. Full information confirmatory factor analyses were conducted in IRTPRO. The unidimensional and bifactor solutions were compared to determine the degree of distortion that the secondary dimensions might cause if the data were forced into a unidimensional IRT model. The factor loadings in the unidimensional solution were extremely similar to the general factor loadings in the bifactor solution, indicating that fitting a unidimensional IRT modeling to these data would not be expected to produce substantial distortion of IRT slope parameter estimates. The results of the various dimensionality assessment methods collectively suggested that the data were unidimensional enough for interpretation of the IRT model. 36 Psychometric Properties of the ICI-R The IRT analysis was conducted on the 11-item ICI-R that remained after the revisions and the GRM demonstrated a good fit, M2(484) = 823.34, p < .001, RMSEA = .03. The S-X2 statistics indicated no significant trace line misfit and the local independence criterion (LD values < 10) was met for all item pairs. The IRT item parameters are provided in Table 5. The slope (a) parameter values ranged from .81 to 1.57 and the location (b) parameter values ranged from -3.52 to 2.69. Thus, the a parameter values were all within the range suggested by de Ayala (2009). Although some of the b parameter values fell outside the range recommended by DeMars (2010), they were not considered so extreme as to be problematic. Table 5 IRT Item Parameter Estimates for the 11-Item ICI-R Item 3 5 9 10 13 15 16 18 21 23 25 a 1.31 1.23 1.57 1.02 1.41 1.29 1.24 .81 .94 1.07 .86 (SE) (.13) (.14) (.15) (.12) (.13) (.12) (.13) (.10) (.11) (.11) (.10) b1 -2.66 -3.52 -1.92 -3.27 -1.20 -1.57 -2.22 -3.17 -2.45 -2.42 -1.54 (SE) (.24) (.37) (.15) (.35) (.11) (.14) (.20) (.38) (.26) (.24) (.19) b2 -1.07 -1.93 -.80 -1.42 -.02 .13 -.82 -1.03 -.57 -.63 .92 (SE) (.11) (.18) (.08) (.16) (.07) (.08) (.10) (.15) (.11) (.10) (.14) b3 .56 -.26 .62 .51 1.09 1.86 .79 .91 1.43 1.03 2.69 (SE) (.09) (.08) (.08) (.11) (.11) (.17) (.11) (.15) (.17) (.13) (.31) The IRT marginal reliability was .80 for the θ (EAP) scores. The test information and standard error curves for the ICI-R are provided in the upper and lower panels of Figure 5, respectively. Information is the highest and error is the lowest from a θ of approximately -2 to 1, indicating that scores on the ICI-R most precisely differentiate between individuals with values of locus of control within this range. Less information is available as θ increases, but the test does provide some information at the upper end of the θ continuum. 37 6 Information 5 4 3 2 1 0 -3 -2 -1 0 1 Locus of Control (θ) 2 3 -1 0 1 Locus of Control (θ) 2 3 0.7 Standard Error 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -3 -2 Figure 5. Test information and standard error curves for the ICI-R. The upper and lower panels indicate the amount of information and error provided by the test at each level of locus of control (θ), respectively. The test characteristic curve for the ICI-R is presented in Figure 6. Possible summed scores on the ICI-R range from 11 to 44. As expected, predicted summed scores increase as θ increases. The function appears to be very nearly linear across the θ scale. A Pearson correlation coefficient was computed to compare estimates of θ and observed summed scores on the ICI-R. 38 Because summed scores were nearly perfectly linearly related to IRT estimates of θ, r(596) = .997, p < .001, they were determined to be a good approximation of the locus of control construct using the 11-item ICI-R scale; however, because IRT estimates of θ are on an equal interval scale Predicted Summed Score and this is not necessarily true for summed scores, θ values were linked to CTT summed scores. 45 40 35 30 25 20 15 10 5 0 -3 -2 -1 0 1 Locus of Control (θ) 2 3 Figure 6. Test characteristic curve for the ICI-R. This function indicates the predicted summed score for each level of locus of control (θ). Table 6 presents the values of the IRT estimates of locus of control (θ) corresponding to each possible summed score. This table can be used to score the ICI-R on an equal interval scale. The marginal reliability of the scaled scores to summed score conversion was .79, suggesting that estimates of θ for the ICI-R can be reliably converted to CTT summed scores. 39 Table 6 Predicted Summed Score to Scale Score Conversion for the ICI-R Pred. Summed Modeled θ SD(θ) Proportion Score 11 -3.36 .55 12 -3.08 .53 13 -2.84 .51 14 -2.63 .49 15 -2.44 .48 16 -2.26 .47 17 -2.08 .47 .01 18 -1.92 .46 .01 19 -1.76 .46 .01 20 -1.61 .45 .01 21 -1.46 .45 .02 22 -1.31 .45 .02 23 -1.17 .44 .03 24 -1.03 .44 .03 25 -.89 .44 .04 26 -.75 .44 .04 27 -.61 .44 .05 Note. A dash (-) indicates a value less than 1%. Pred. Summed Score 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 θ SD(θ) Modeled Proportion -.47 -.33 -.19 -.05 .10 .24 .39 .55 .70 .87 1.04 1.23 1.42 1.63 1.86 2.11 2.43 .44 .44 .45 .45 .45 .45 .46 .46 .47 .47 .48 .49 .51 .52 .54 .56 .60 .05 .06 .06 .06 .06 .06 .06 .06 .05 .05 .04 .03 .03 .02 .01 .01 - Comparison of the ICI and ICI-R The coefficient α of .78 for the 11-item ICI-R was reasonably comparable to the .83 value for the original scale given the removal of nearly two thirds of the items. Additionally, summed scores on the original and new scoring strategies correlated highly, r(596) = .82, p < .001. Table 7 contains Pearson correlations indexing the relationships of the ICI and ICI-R with trait anxiety, learning orientation, performance orientation, subjective meaning, positive affect, and negative affect. The same pattern of correlational coefficients was computed on θ scores for the ICI-R but yielded correlations that were virtually identical (to a thousandth of a decimal place) to those for ICI-R summed scores, so they are omitted here. Table 7 also provides Cohen’s (1988) q statistic. 40 To compute q, correlation coefficients are transformed to z scores and then the absolute value of the difference is taken. The resulting value represents the distance between the two correlation coefficients on the z metric. As was done in Kastner, Sellbom, & Lilienfeld (2012), the q values were evaluated based on Jacob Cohens’s (1988) guidelines for interpreting Pearson r values; that is, values of .10, .30, and .50 were interpreted as small, medium, and large effects, respectively. The patterns of correlations for the ICI and ICI-R were generally similar; however, when combined with the correlation coefficients, the q values indicated that ICI-R scores were less related to trait anxiety (q = .19) and negative affect (q = .17) compared to ICI scores. Table 7 Correlations of the ICI and ICI-R with Other Constructs ICI ICI-R r 95% CI r 95% CI q -.49*** .25*** .02 [-.57, -.40] [.14, .35] [-.09, .13] -.33*** .29*** .09 [-.43, -.23] [.18, .39] [-.02, .20] .19 .04 .07 Subjective Meaning .37*** Negative Affect -.38*** Positive Affect .47*** Note. 95% CI = 95% confidence interval. [.27, .46] [-.47, -.28] [.38, .55] .38*** -.23*** .52*** [.28, .47] [-.34, -.12] [.43, .60] .01 .17 .07 Study 1 Variables a Trait Anxiety Learning Orientation Performance Orientation Study 2 Variablesb a N = 296. bN = 295. ***p < .001. 41 Chapter 4 DISCUSSION This study utilized a variety of psychometric methods, most of which are appropriate for ordered categorical data, to evaluate and revise the ICI, which was designed to assess locus of control in adults. The revisions, which resulted in the 11-item ICI-R, were guided by psychometric principles and locus of control theory (Rotter, 1966, 1990), the latter of which suggests that locus of control is a broadly defined individual difference trait with diverse behavioral indicators. Thus, throughout the revision process, efforts were made to clarify the dimensional structure of the inventory without needlessly narrowing the conceptual breadth of the underlying latent trait () assessed by the inventory. The traditional ICI response scale consists of five ordered categories. The first response option was generally not well-utilized by our respondents, so we combined the first two response categories to maintain a consistent response metric across all of the items; thus, we recommend that future researchers adopt a four-point response scale for the inventory (1 = rarely/occasionally, 2 = sometimes, 3 = frequently, 4 = usually). Our exploratory item factor analyses suggested an unclear dimensional structure for the inventory. This lack of clarity was consistent with the findings of previous researchers who conducted exploratory dimensionality assessments on the inventory (Duttweiler, 1984; Meyers & Wong, 1988). Based on our exploratory item factor analyses and our preliminary IRT analysis, the multidimensionality in the data appeared to be due to some items not loading on the latent trait (), a possible method effect due to item scoring (straightforwardly worded vs. reverse scored), and some items that were so semantically similar that they formed nuisance dimensions that were more conceptually narrow 42 than desired, given that locus of control (Rotter, 1966, 1990) is a theoretically broad construct. Based on these observations, many ICI items were removed from the inventory, resulting in the 11-item ICI-R. We judged some of the removed items to be somewhat awkwardly written (e.g., “When I’m involved in something I try to find out all I can about what is going on, even when someone else is in charge.”) or possibly even confusing to participants (e.g., “When faced with a problem I try to forget it.”). The results of the IRT analysis on the 11-item ICI-R suggested that a unidimensional model was acceptable, suggesting that the results of the IRT analysis can be meaningfully interpreted and a single scale score for the ICI-R is an appropriate scoring strategy. In order to evaluate the construct validity of the ICI-R, scores on the ICI and the ICI-R were correlated with other constructs that were expected to be related to locus of control. The ICI and ICI-R demonstrated similar patterns of correlations with measures of learning orientation, performance orientation, and subjective meaning. These correlational coefficients were similar to previously reported coefficients relating locus of control to learning and performance orientation (Heintz & Steele-Johnson, 2004; Phillips & Gully, 1997) and subjective meaning (Zika & Chamberlain, 1987) in college student samples. However, in comparison to ICI scores, ICI-R scores were not as strongly related to trait anxiety or negative affect. Based on the findings of Archer (1979), the relationship between trait anxiety and the ICI-R was more consistent with previous findings than was the anxiety-ICI relationship. In terms of affect, the results are less clearly interpretable, as empirical relationships between locus of control and dimensions of affect have varied in the research literature (e.g., Christopher, Saliba, & Deadmarsh, 2009; Emmons & Diener, 1985). Several limitations of this study may attenuate the generalizability of the results. The revisions that resulted in the ICI-R were inherently exploratory and based on only two very similar samples of students at a large California university. Additionally, we did generate an a 43 priori hypothesis that the psychometric properties of the ICI could be improved by removing some items, but specific decisions regarding which items to remove were admittedly post hoc. Although efforts were made throughout the revision process to retain as much of the content of the ICI as possible, it is true that the revisions were guided primarily by statistical analyses, and thus some important content may have been lost. The ICI-R would benefit from an independent content review by an expert on locus of control to determine whether the construct assessed is consistent with locus of control theory. In terms of the psychometric properties of the ICI-R, future research is needed to determine if the dimensionality of the ICI-R and the results of the IRT analysis hold in an independent sample (i.e., cross-validation is needed). Additionally, the ICI-R items have not been screened for differential item functioning for sex or ethnicity; it is therefore unclear whether the results of the psychometric analyses apply equally to members of different sex or ethnic groups. Future research may shed light on this issue. 44 REFERENCES Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22, 37-53. Algina, J. & Penfield, R. D. (2009). Classical test theory. In R. Millsap & A. Maydeu-Olivares (Eds.). The Sage handbook of quantitative methods in psychology (pp. 93-122). Thousand Oaks, CA: Sage. Archer, R. P. (1979). Relationship between locus of control and anxiety. Journal of Personality Assessment, 43, 617-626. Bentler, P. M. (2009). Alpha, dimension-free, and model-based internal consistency reliability. Psychometrika, 74, 137-143. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. New York, NY: Cambridge University Press. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425-440. Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research, 36, 111-150. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA: Sage. 45 Browne, M. W., Cudeck, R., Tateneni, K. & Mels G. (2008). CEFA: Comprehensive Exploratory Factor Analysis, Version 3.03 [Computer software and manual]. Retrieved from http://faculty.psy.ohio-state.edu/browne/ Cai, L., du Toit, S. H. C., & Thissen, D. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software and manual]. Chicago, IL: Scientific Software International. Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59, 173-194. Chen, W-H, & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265-289. Chernyshenko, O. S., Stark, S., Chan, K.-Y., Drasgow, F., & Williams, B. (2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36, 523-562. Christopher, A. N., Saliba, L., & Deadmarsh, E. J. (2009). Materialism and well-being: The mediating effect of locus of control. Personality and Individual Differences, 46, 682-686. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Erlbaum. Cook, K. F., Kallen, M. A., & Amtmann, D. (2009). Having a fit: Impact of number of items and distribution of data on traditional criteria for assessing IRT’s unidimensionality assuption. Quality of Life Research, 18, 447-460. de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford. DeMars, C. (2010). Item response theory. New York, NY: Oxford University Press. 46 DeNeve, K. M., & Cooper, H. (1998). The happy personality: A meta-analysis of 137 personality traits and subjective well-being. Psychological Bulletin, 124, 197-229. Duttweiler, P. C. (1984). The Internal Control Index: A newly developed measure of locus of control. Educational and Psychological Measurement, 44, 209-221. Edwards, M. C. (2009). An introduction to item response theory using the Need for Cognition Scale. Social and Personality Psychology Compass, 3, 507-529. Elliot, A. J., & Murayama, K. (2008). On the measurement of achievement goals: Critique, illustration, and application. Journal of Educational Psychology, 100, 613-628. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Emmons, R. A., & Diener, E. (1985). Personality correlates of subjective well-being. Personality and Social Psychology Bulletin, 11, 89-97. Furnham, A., & Steele, H. (1993). Measuring locus of control: A critique of general, children’s, health- and work-related locus of control questionnaires. British Journal of Psychology, 84, 443-479. Goodman, S. H., & Waters, L. K. (1987). Convergent validity of five locus of control scales. Educational and Psychological Measurement, 47, 743-747. Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley. Gustafsson, J.-E., & Åberg-Bengtsson, L. (2010). Unidimensionality and interpretability of psychological instruments. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 123-144). Washington, DC: American Psychological Association. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed.) (pp. 65-110). United States: American Council on Education and Praeger Publishers. 47 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Heintz, Jr., P., & Steele-Johnson, D. (2004). Clarifying the conceptual definitions of goal orientation dimensions: Competence, control, and evaluation. Organizational Analysis, 12, 5-19. Horn, J. L. (2005). Neglected thinking about measurement models in behavioral science research. In A. Maydeu-Olivares & J. J. McArdle (Eds.). Contemporary psychometrics: A festschrift for Roderick P. McDonald (pp. 101-122). Mahwah, NJ: Lawrence Erlbaum. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria vs. new alternatives. Structural Equation Modeling, 6, 1-55. Jacobs, K. W. (1993). Psychometric properties of the Internal Control Index. Psychological Reports, 73, 251-255. Jöreskog, K. G., & Sörbom, D. (2004). LISREL 8.7 for Windows [Computer software]. Lincolnwood, IL: Scientific Software International, Inc. Kastner, R. M., Sellbom, M., & Lilienfeld, S. O. (2012). A comparison of the psychometric properties of the Psychopathic Personality Inventory full-length and short-form versions. Psychological Assessment, 24, 261-267. Kormanik, M. B., & Rocco, T. S. (2009). Internal versus external control of reinforcement: A review of the locus of control construct. Human Resource Development Review, 8, 463483. Lefcourt, H. M. (1966). Internal versus external control of reinforcement: A review. Psychological Bulletin, 65, 206-220. Lefcourt, H. M. (1982). Locus of control: Current trends in theory and research (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. 48 Lefcourt, H. M. (1992). Durability and impact of the locus of control construct. Psychological Bulletin, 112, 411-414. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Maltby, J., & Cope, C. D. (1996). Reliability estimates of the Internal Control Index among UK samples. Psychological Reports, 79, 595-598. Maydeu-Olivares, A. (2005). Further empirical results on parametric versus non-parametric IRT modeling of Likert-type personality data. Multivariate Behavioral Research, 40, 261-279. Maydeu-Olivares, A., Cai, L., & Hernández, A. (2011). Comparing the fit of item response theory and factor analysis models. Structural equation modeling, 18, 333-356. Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009-1020. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates. Meyers, L. S., & Wong, D. T. (1988). Validation of a new test of locus of control: The Internal Control Index. Educational and Psychological Measurement, 48, 753-761. Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics: Application of item response theory models in personality research. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality (pp. 407-423). New York, NY: The Guilford Press. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed). San Francisco, CA: McGraw-Hill. 49 Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64. Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289-298. Phillips, J. M., & Gully, S. M. (1997). Role of goal orientation, ability, need for achievement, and locus of control in the self-efficacy and goal-setting process. Journal of Applied Psychology, 82, 792-802. Preacher, K. J., & MacCallum, R., C. (2003). Repairing Tom Swift’s electric factor analysis machine. Understanding Statistics, 2, 13-43. Reise, S. P. (under review). The rebirth of bifactor measurement models. Reise, S. P., Moore, T. M., Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of personality assessment, 92, 544-559. Rotter, J. B. (1954). Social learning and clinical psychology. New York, NY: Prentice-Hall, Inc. Rotter, J. B. (1966). Generalized expectancies for internal versus external control of reinforcement. Psychological Monographs: General and Applied, 80 (1, Whole No. 609). Rotter, J. B. (1975). Some problems and misconceptions related to the construct of internal versus external control of reinforcement. Journal of Consulting and Clinical Psychology, 43, 5667. Rotter, J. B. (1990). Internal versus external control of reinforcement: A case history of a variable. American Psychologist, 45, 489-493. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 17. 50 Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory. New York, NY: Springer-Verlag. pp. 85-100. Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107-120. Spearman, C. (1904a). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101. Spearman, C. (1904b). “General intelligence” objectively determined and measured. American Journal of Psychology, 15, 201-292. Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 18, 161-169. Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271-295. Spearman, C. (1913). Correlation of sums and differences. British Journal of Psychology, 5, 417426. Spielberger, C. D., Gorsuch, R. L., Lushene, R., Vagg, P. R., & Jacobs, G. A. (1983). Manual for the State-Trait Anxiety Inventory (Form Y): (“Self-Evaluation Questionnaire”). Palo Alto, CA: Consulting Psychologists Press. Steger, M. F., Frazier, P., Oishi, S., & Kaler, M. (2006). The Meaning in Life Questionnnaire: Assessing the presence of and search for meaning in life. Journal of Counseling Psychology, 53, 80-93. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates. 51 Thissen, D., & Steinberg, L. (2009). Item response theory. In R. Millsap & A. Maydeu-Olivares (Eds.), The Sage handbook of quantitative methods in psychology (pp. 148-177). Thousand Oaks, CA: Sage. Thissen, D., & Steinberg, L. (2010). Using item response theory to disentangle constructs at different levels of generality. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 123-144). Washington, DC: American Psychological Association. Thissen, D., & Wainer, H. (2001). Overview of Test Scoring. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 1-19). Mahwah, NJ: Lawrence Erlbaum Associates. Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 16, 433-449. von Davier, M. (2010). Mixture distribution item response theory, latent class analysis, and diagnostic mixture models. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 11-34). Washington, DC: American Psychological Association. Waller, N. G., & Reise, S. P. (2010). Measuring psychopathology with nonstandard item response theory models: Fitting the four-parameter model to the Minnesota Multiphasic Personality Inventory. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 147-173). Washington, DC: American Psychological Association. Watson, D., Clark, L. A. (1994). The PANAS-X. Manual for the Positive and Negative Affect Schedule – Expanded Form. Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12, 58-79. 52 Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213. Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed.) (pp. 111-153 United States: American Council on Education and Praeger Publishers. Zika, S., & Chamberlain, K. (1987). Relation of hassles and personality to subjective well-being. Journal of Personality and Social Psychology, 53, 155-162. Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P. (2006). Estimating generalizability to a latent variable common to all of a scale’s indicators: A comparison of estimators for ωh. Applied Psychological Measurement, 30, 121-144.