Measuring Individual Differences with Signal Detection Analysis: A Guide to Indices based on Knowledge Ratings Delroy L. Paulhus University of British Columbia and William M. Petrusic Carleton University Draft Date: February 2007 2 ABSTRACT The use of signal detection indices for measuring individual differences has been given limited attention. Criteria are laid out for selecting indices of individual accuracy and bias. The criteria include (a) theoretical cogency, (b) simplicity of calculation and comprehension, (3) structural location, and (4) convergence with external measures of accuracy and bias. The empirical criteria are evaluated in two studies of respondents asked to rate their familiarity with real and non-existent general knowledge items. Structural analyses supported the traditional separation of accuracy and bias indices. Construct validity criteria were also considered. In the knowledge domain, such indices represent ability and self-enhancement constructs. Although most standard indices received support, our highest recommendation goes to two pairs of indices: (a) the traditional d’ and criterion location, c, and (b) the ‘common sense’ alternatives, difference score and yes-rate. Our recommendations should be of particular interest to researchers doing large measurement studies, especially personality, industrial-organizational, and educational psychologists. 2 3 Since the 1960’s, accuracy and bias measures derived from signal detection theory (SDT1) have been a mainstay of experimental research on perception and cognition. An extension that incorporates behavioral analysis has also been active (see Commons, Nevin, & Davison, 1991). The SDT approach has frequently been recommended for other branches of psychological research including personality (Price, 1966), experimental social psychology (e.g., Martin & Rovira, 1981), and clinical psychology (Grossberg & Grant, 1978; McFall & Treat, 1999; Stenson, Kleinmuntz, & Scott, 1975). Despite the repeated encouragement, use of SDT is still limited in these other fields of research. Even when used, SDT techniques have rarely been applied to measure individual differences. The classical conception of SDT begins with the assumption that a decision depends on properties of both the stimulus and the observer. As represented in an ROC curve, the observer system has a fixed ability to discriminate a given signal source but can trade off hits against false alarms by adjusting the response criterion (Swets, 2000). The criterion location is set in response to expectancy and motivational factors. When applied to individual differences, each observer is assigned separate accuracy and bias scores based on responses to a stimulus set (e.g., Harvey, 1992). Scores on these two indices can then be used to predict other individual difference variables. One factor that has deterred researchers from exploiting the power of signal detection methods is the daunting variety and complexity of available indices. Expositions by Swets (1986) and Macmillan and Creelman (1990; 1991) have advanced the cause by laying out criteria 1 We use the term Signal Detection Theory (SDT) as a broad rubric including various threshold and logistic models as well as Gaussian versions. 3 4 for choosing among the thirty or so available indices of accuracy and bias.2 Their deliberations led them to recommend (a) d’, difference score, and ROC-area as indices of accuracy and (b) criterion location (c) and yes-rate as indices of response bias. Following their lead, our plan is to lay out criteria for choosing among SDT indices for the purpose of measuring individual differences. Out of many possibilities, we chose accuracy and bias, as the most appropriate terms for describing persons as opposed to tasks or conditions, where terms such as discriminability and criterion location, might be appropriate. We will argue that the optimal indices for experimental comparisons are not necessarily optimal for individual difference research. Measuring Individual Differences In the typical application of SDT, parameter estimates are compared across experimental conditions where pay-offs, base-rates, or other variables have been manipulated. Such experimental studies were the primary basis for previous recommendations about measures of sensitivity (Swets, 1986) and bias (Macmillan & Creelman, 1990). It is not clear whether their conclusions apply to the use of SDT indices in individual difference research. After all, many psychological variables do not function the same way within-persons as they do between-persons (John & Benet-Martinez, 2000; Ashby & Townsend, 1986). Moreover, an association that is consistent across all respondents (e.g., a step function) can be masked by averaging data across a sample of individuals (Estes, 1956; but see MacMillan & Kaplan, 1985). Another reason why SDT indices may not function the same way in experimental as in correlational research lies in the nature of the analytic methods. Variations in distributional properties may have more (or at least different) effects on the analysis of correlations than on the 2 Many of these indices require rating scales rather than a single dichotomous response (Egan et 4 5 analysis of mean differences. Consider, for example, the relative robustness of ANOVA to nonnormality (Glass & Hopkins, 1999); in contrast, the possible values of correlations are highly constrained when the distributions of the two variables are not symmetric (Dunlap, Burke, & Greer, 1995; McLennan, 1988). In short, there is a singular need for an evaluation and comparison of available SDT accuracy and bias indices as measures of individual differences. A user-friendly, accessible comparison source would be of particular value to researchers. Representative Literatures From the advent of SDT, there were scattered examples of researchers assigning accuracy and/or bias scores to individuals and correlating those indices with other variables (Swets, 1964). It was not long before questions were raised about the use of some indices for this purpose (e.g., Ingham, 1970). Nonetheless, measurement of individual differences has flourished in a few domains of psychological research where SDT methods were ideal. One active application has been the search for behavioral correlates of basic personality traits (e.g., Geen, McCown, & Broyles, 1985; Stelmack, Wieland, Wall, & Plouffe, 1984). A common interpretation of the Eysenck and Gray personality dimensions (Extraversion and Neuroticism) is a pair of cue sensitivities, namely, sensitivity to reward and punishment, respectively. SDT analyses largely supported this interpretation (e.g., Danziger & Larsen, 1989). This application of SDT is ideal because the core issue involved the distinction between (a) general tendencies to report reward and punishment stimuli and (b) accurate recognition and memory for those stimuli. al., 1959). 5 6 In clinical research, SDT has been applied to debates over defense mechanisms (e.g., Dorfman, 1967; van Egeren, 1968; Paulhus & Levitt, 1986). SDT method seemed necessary for determining whether some individuals (repressors) show memory deficits for affective stimuli or simply a bias to say no. The method continues to be applied although the issues remain contentious (Lindsay, 1997). A similar set of issues was involved in evaluating whether depressives are more or less sensitive in recalling negative events (e.g., Dunbar & Lishman, 1984). In educational research, SDT has also been exploited to understand the relation between a reading frequency and cognitive ability (Stanovich & Cunningham, 1993; Stanovich & West, 1998). Self-reported familiarity with a list of books and authors requires a separate scoring for a student’s accuracy and bias. Although they did not use the bias scores, the accuracy score was a much more effective measure once false alarms had been controlled. As noted earlier, the research domain where SDT techniques flourish most is memory research (Yonelinas, 1994). Although the typical application involves comparisons of index means across experimental conditions, correlations with individual difference scores have been analyzed from time to time (Koriat, Goldsmith, & Pansky, 2000). One purpose for calculating between-subject correlations has been to argue for or against distinctions among various types of memory (Gardiner, Ramponi, & Richardson-Klavehn, 2002; Inoue & Belleza, 1998). The rationale is that low performance intercorrelations suggest that the two memory mechanisms must be independent. Similarly, a high between-subject correlation was used to argue for interdependence of color and form perception (Cohen, 1997). Although our present report touts the use of individual difference correlations, we advise caution about use of such betweensubject correlations to make inferences for or against process mechanisms. 6 7 Finally, the utility of SDT accuracy indices has used to evaluate the psychometrics of standard tests. For example, Lanning (1989) evaluated the ability of a faking template to distinguish faked from non-faked protocols. Of several indices of discrimination, d’ turned out to be the most useful in comparing different cut scores. Lanning also analyzed the slope of the ROC line as useful criterion for evaluating index choice. Humphreys and Swets (1991) investigated the relation between a 9-point supervisor rating of cadet performance and an independently scored pass-fail criterion. They compared the performance of the full rating scale with that of an ROC-area index of accuracy in predicting the criterion. The ROC-area index showed superior performance, albeit, only at extreme cutoffs. Because each respondent received only one criterion score, the authors were not able to calculate ROC curve for each individual. Research that did estimate the entire ROC curve for each individual was recently reported by Paulhus and Harms (2004). In a series of studies, they collected familiarity ratings for existent and non-existent items and calculated a number of accuracy and bias indices. Because large item sets were presented to each respondent, SDT indices could be calculated at each of six cutoff points. Several indices of knowledge accuracy were shown to predict scores on a standard IQ test. A similar approach has been followed by Rentfrow and Gosling (in press) in their analysis of music knowledge and its correlates. But Which Domain? The degree of consistency of individual accuracy and bias scores across domains is unknown and awaits research. From an individual difference perspective, the application of SDT methods will be most useful in a stimulus domain where bias is evident. Note that positive and negative biases are unlikely to be tapping the same individual difference variable and will have 7 8 to be measured separately (Paulhus 1991; Watson & Clark, 1992). In this report, we will limit our attention to positive biases. In the performance estimation literature, overconfidence seems to be the general rule. Some have argued that overconfidence does not obtain in the perceptual domain (Bjorkman, Juslin, & Winman, 1993; Juslin & Olsson, 1997) but the bulk of the evidence indicates that it does (BarTal, Sarid, & Kishon-Rabin, 2001; Baranski & Petrusic, 1994; 1999; Petrusic & Baranski, 1997; Stankov, 2000). Certainly, there is consensus that overconfidence is the rule in the knowledge estimation domain3 (e.g., Baranski & Petrusic, 1995; Griffin & Beuhler, 1999; Lichenstein & Fishchoff, 1983). SDT response bias, although conceptually similar, has a rather different operationalization from overconfidence. Calculation of the latter requires comparisons across a range of item difficulty to determine the discrepancy between a respondents’ (a) estimates of percent correct they will obtain and (b) the actual percent correct they obtain. Hence the calculation requires reference to a performance criterion. To the extent that an individual is overconfident, he or she is inaccurate. In SDT, however, the concept of response bias is conceptually independent of accuracy. In the traditional (Gaussian) SDT framework, the strength of a signal is a continuous variable and the detector must decide on criterion value that warrants a ‘yes’ response. Setting a high or low criterion is, in a sense, an arbitrary matter of preference that has no necessary bearing on accuracy. Nonetheless, the fact that overconfidence obtains in the knowledge domain leads us to believe that SDT response bias in that domain will also be positive, that is, shifted to the “yes” 8 9 side. Other reasons for choosing the knowledge domain include the fact that the domain is better understood and student participants are comfortable with the administration format (i.e., knowledge questions). We will use multi-point rating scales to collect responses. This choice opens up many options for accuracy and bias indices (Swets, 1986; Macmilland & Creelman, 1990). Instead of a single estimation point, such rating scales permit the generation and analysis of full ROC curves that are amenable to a variety of quantifiable properties (e.g., mean criterion location, slope, and area-under-curve). A Construct Approach Our special focus will be on a practical domain of SDT indices, namely, those scored from general knowledge ratings (e.g., Baranski & Petrusic, 1994; Griffin & Buehler, 1999; Lichtenstein & Fischoff, 1977; Paulhus & Harms, 2004). In this domain, we hope to show how the accuracy and bias indices can be used to operationalize more fundamental individual difference constructs. Traditional measures of knowledge accuracy include multiple choice tests where respondents must pick the correct answer and reject the offered foils. Conceptually, such measures can be viewed as tapping crystallized intelligence (Horn & Cattell, 1966; Rolfhus & Ackerman, 1999). As far as we know, indices of knowledge response bias have never been viewed as indicators of a more general bias construct. Traditional measures of bias on questionnaire items have been studied at length under the rubric “response styles” (e.g., Baer, Rinaldo, & Berry, 2002; Jackson & Messick, 1961; Paulhus, 1984; 1991). And much debate has swirled around the issue of whether response styles generalize beyond questionnaire behavior (Block, 1965; Hogan & Nicholson, 1992; Saucier, 1994). Surprisingly, we know of no literature comparing response 3 Note, however, that overconfidence is more often found with difficult items than with easy 9 10 styles to SDT indices.4 As individual differences, both response styles and consistent tendencies in accuracy and response bias are likely to have common roots in more fundamental differences in personality, values, motivation, and political ideology. The potential for a trait interpretation of SDT indices suggests a novel approach to evaluating them. We propose validating accuracy and bias measures by their convergence across persons with external measures of the same construct. Achieving this goal will require respondent samples large enough to permit evaluation of associations between SDT indices and standard individual difference measures. Instead of being manipulated, scores on accuracy and bias indices will vary naturally. And our substantive questions concern whether those variations are indicative of variation in cognitive ability, personality, and values. In the general knowledge domain, accuracy indices should converge with standard measures of cognitive ability (e.g., Paulhus & Bruce, 1990; Stanovich & West, 1998). Response bias (saying ‘yes’) on general knowledge items should converge with trait measures of selfenhancing tendencies (Paulhus, Harms, Bruce, & Lysy, 2003). In short, this novel approach to evaluating SDT indices in the knowledge domain is equivalent to comparing their construct validity as measures of cognitive ability and narcissistic self-enhancement. Overview Our goals in this report are both broad and narrow. The broad goal is to raise awareness of SDT as a valuable method for measuring individual differences. Our narrower goal is to consider the construct validity of SDT indices in the knowledge domain. At its narrowest, our 4 items (Gigerenzer, Hoffrage, & Kleinbolting, 1992; Griffin & Tversky, 1992). It is not immediately obvious how response styles and SDT indices should correspond. Most suggestive is a possible correspondence between acquiescence response bias (e.g., Knowles & Condon, 1999; Messick, 1967). A direct correspondence was ruled out by Paulhus et al. (2003). 10 11 goal is to compare and contrast available SDT indices. Our arguments are partitioned into (a) theoretical considerations, (b) practical considerations including ease of calculation, (c) structural relations among indices, and (d) convergence with external measures of accuracy and bias. As it turns out, these diverse criteria all lead us toward the same general recommendations. PART 1: THEORETICAL AND EMPIRICAL CONSIDERATIONS Before offering any new data, we consider theoretical reasons to recommend or discourage the use of various indices of accuracy and bias. Note again that the emphasis will be on their potential use as individual difference variables. Table 1 lays out many of the prominent indices that have been paired in common usage. Consistency with statistical decision theories. Complex debates continue regarding the relative merits of various decision theories. We do not intend to enter those (often contentious) debates. Strong advocates may prefer indices simply because of the theoretical origins. Others have argued that, at minimum, the pairing of accuracy and bias measures should be based on a common theoretical foundation (Macmillan & Creelman, 1991). For example, d΄ and beta originate in likelihood ratio calculations. Also, log(alpha) and log(b) have a common logistic basis. We will not place a high weight on this criterion: The option of selecting the best pair of indices should not be restricted in this way. All other things being equal, of course, we agree that common theoretical origin has the appeal of organizational elegance. As the reader might anticipate, other things are not equal. Mathematical and empirical overlap of accuracy and bias indices Foremost among the original goals of signal detection theory was to distinguish the notions of accuracy and response bias (Swets, 1964; 2000). Underlying this goal was the theoretical assumption that sensory mechanisms are independent of decision mechanisms. That assumption 11 12 was reified in the depiction of ROC curves as a decision-maker with fixed accuracy but varying response bias: The implication is that decision-makers of any accuracy (i.e., sensitivity level) can choose any bias level (i.e., criterion). Unfortunately, this apparent independence is not effected in certain index pairs, most notably, the traditional pairing of d΄ and beta (Swets, 1964). The calculations of d΄ and beta are mathematically linked in such a way that the value of beta approaches 1.00 as accuracy approaches zero. Some time ago, Ingham (1970) raised questions about the use of beta as an index of individual differences in bias. Indeed, lack of independence is fatal for research programs posing questions about the relation between ability and confidence (e.g., Lichtenstein & Fischoff, 1977; Paulhus, Lysy, & Yik, 1998; Stankov & Crawford, 1997). A satisfactory pair of SDT indices – unlike d’ and beta -- must permit individuals with little ability to manifest bias. Fortunately, certain other bias measures such as criterion location (c) do not suffer from this mathematical interdependence with d΄. Nonetheless, such pairings may still be correlated across persons even in large representative samples of respondents. For example, Park, Lee, and Lee (1999) found that, within conditions, d΄ was negatively correlated with a lax criterion level (“yes”-bias). Other studies, however have found positive associations between accuracy and bias (Leelavathi, & Venkatramaiah, 1976). In short, the empirical association between accuracy and bias seems to vary across samples and context. Choosing a pair of so-called non-parametric indices implies the advantage of requiring no distributional assumptions. But even the putatively non-parametric measures such as A’ and B’’ turn out to be parametric (Macmillan & Creelman, 1991). Moreover, B’ and B’’ are monotonic functions of beta and, accordingly, may suffer from the same deficits as beta. 12 13 Also problematic are the sensitivity-relative or ratio measures (e.g., c ’; error ratio). They index bias relative to accuracy. Although useful for some purposes, such indices will predictably fail as individual difference measures because division by accuracy introduces a mathematical confound between accuracy and bias. Consistency across cutoffs Another desirable property of SDT indices calculated from rating data is consistency across cutoff points. Of course, bias scores should decrease monotonically with an increasing strict criterion. But, if accuracy scores vary across cutoffs, the SDT method becomes less credible as a model of individual differences. Similarly, the association of accuracy and bias scores should remain consistent across cutoffs. It would be difficult to draw conclusions about stable individual differences from SDT estimates that vary across cutoffs. In short, from the theoretical perspective of trait measurement, ROC line slopes within persons should approximate 1.0. As far as we know, there is no empirical research on the impact of overlap and cut-off consistency on the measurement of individual differences. In Section 3, below, we provide some preliminary data. PART 2: PRACTICAL CONSIDERATIONS Ease of comprehension Other things being equal, simple straightforward indices are more likely to be favored by researchers. Because science requires clear communication of ideas, the more easily communicated indices form a more valuable tool. Closely related is the notion that indices with the most intuitive appeal are preferable. On this basis, the two indices forming the ‘commonsense’ pair (difference score, yes-rate) have the most compelling connection with an intuitive conception of SDT. The notion that an accurate 13 14 individual should claim more real items than foils is easy to grasp. The false alarm rate has intuitive appeal as a measure of bias because it reflects claims of non-existent items. Only a slight increase in sophistication is necessary to be convinced that the yes-rate is preferable to the false-alarm rate: After all, the yes-rate includes more of the data and people who are biased on foils should also be biased on real items. Ease of collection and computation In most research contexts, rating data are as easy to collect as dichotomous responses (e.g., True-False, Yes-No). By this criterion alone, indices based on the full ROC curve have equal standing with dichotomous formats. Given that the data are in rating format, then cumulative probabilities for hits and false alarms can be calculated directly for each respondent at each cutoff point. The corresponding ROC curve yields a wide range of SDT indices that vary dramatically in their ease of calculation. Therefore, difficulty of calculation becomes a competitive criterion for choosing both accuracy and bias indices. We will assume that contemporary researchers have access to a standard statistical package such as SPSS, SAS, STATISTICA, or BMDP. An argument can be made for rejecting indices that cannot easily be calculated in a standard statistical package. Even within that limitation, preference should be given to indices that minimize the degree of transformation required to convert raw data into the indices. Calculations that can be conducted with only a z-score transformation (or approximation thereof) is ideal. Currently available statistical packages are severely limited in their ability to facilitate SDT calculations of individual differences. Indices that are built into programs such as SPSS and SYSTAT treat all the input data as one experimental unit. To get subject-wise estimates, the program has to be run once for each respondent in the researcher’s dataset. 14 15 Because large samples are required for individual difference research, this limitation looms large. Researchers need the ability to correlate the SDT indices with other individual difference variables (ability and personality tests, for example). Specialty programs such as RSCORE (Dorfman & Bernbaum, 1986) do produce a variety of SDT indices. But, because they are written in lower-level languages such as Fortran, Pascal, or C, such specialty programs cannot easily be integrated into standard packages such as SPSS. A two- or three-step analytic process will be necessary: Probabilities accumulated in SPSS have to be saved for analysis in RSCORE; Indices from the latter must then be fed back into SPSS where they can be rendered by a full variety of statistical procedures. For some indices, simplicity of calculation may well entail loss of precision. Of particular importance is the issue of least-squares versus maximum likelihood estimation of parameters. Indices based directly on ROC curves are better estimated by the latter. But ML estimates of SDT indices simply cannot be calculated within standard packages. The effects of this loss of precision will be evaluated in the data presented below. Finally, SDT indices vary in terms of their distortion when aggregated across observers. Macmillan and Kaplan (1985) showed that minimal distortion of d’ is incurred by the aggregation across respondents. The likelihood ratio, beta, performs worst on this criterion because the isobias lines are most deeply curved. PART 3: STRUCTURAL AND CONSTRUCT CONSIDERATIONS How do the various SDT accuracy and bias indices relate to each other when correlated across individuals? How do they fit into the larger nomological network of individual 15 16 differences in ability and personality? Which of the indices best predict corresponding external measures of the same construct? Such questions are rarely voiced in the SDT literature. In the individual difference literature, however, such questions address the fundamental questions of construct validity. Addressing these issues requires either simulation data or real data. In this section, we apply the full gamut of SDT indices in two studies of individual differences. . Study 1: Structural Properties of 19 SDT indices There is no published literature on how available SDT indices inter-relate. A factor anlysis would advance our understand by revealing the number and nature of latent variables underlying the 19 indices measured here. The emergence of a single factor would be encouraging. If more than one factor emerges, their interpretation becomes critical. Previously published research might have to be reconsidered in light of this complexity. Participants We collected data from 130 student participants (62% female) who received extra credit points for their participation. In small groups, they completed a battery of personality tests and a general knowledge familiarity questionnaire. In separate, supervised sessions, all participants completed a standardized IQ test. The knowledge familiarity instrument: The general knowledge items were came from a standard inventory of items known as the Over-Claiming Questionnaire (OCQ)(Paulhus & Bruce, 1990). The 150 items cover various academic topics (e.g., history, literature, fine arts, science, language, and law). The items were selected randomly from the comprehensive list assembled by Hirsch (1988). The topics are not unlike those domains of knowledge covered by Rolfhus and Ackerman (1996), who also attempted to be comprehensive. 16 17 Each item is rated for familiarity on a 7-point scale ranging from “never heard of it” (0) to “very familiar” (6). Twenty percent of the items in each topic do not -- as far as we know -actually exist. The topic headings are provided to orient the respondent to the appropriate field. Ten years of data support use of the OCQ in measuring self-enhancement (Paulhus et al., 2003) and cognitive ability (Paulhus & Harms, 2004). The instrument was standardized on several thousand respondents and has been demonstrated to capture individual differences in cognitive ability and self-enhancement. The instrument has been shown to be invulnerable to (a) faking instructions and (b) disclosure that the inventory contains non-existent items (Paulhus et al., 2003). RESULTS A total of 19 SDT indices were scored as a function of the overall hit-rate and false-alarm rate calculated from the knowledge familiarity instrument, the OCQ. An examination of gender differences showed no consistent or interpretable differences; hence, the data were collapsed across gender. To avoid division by zero, we converted all zero values of H or F to .01. Similarly, we converted all values of 1.00 to .99. Descriptive statistics Table 2 presents the basic distributional statistics for each of the accuracy indices. Of particular interest are the comparative values for skewness and kurtosis. Values near zero are closest to those of a normal distribution – generally a desirable quality. All skewness values are negative suggesting the existence outliers with low values. An examination of the distribution plots confirmed this suspicion. But some accuracy indices seem to accommodate these low values without significant skewing of the distribution. In particular, hit-rate, d’, log (alpha), and da are best and q is worst. With one exception, kurtosis values are all positive indicating that 17 18 accuracy index distributions are more peaked than normal distributions. Hence, hit-rate, d’, A’, and log(alpha) are best, with q, once again, showing a distribution exeptionally divergent from normal. Table 3 presents the basic distributional statistics for each of the bias indices. With two exceptions, skewness values are negative suggesting the existence of outliers with lower values. An examination of the distribution plots confirmed this suspicion. Of the 11 indices, ca, B’’, hitrate are least skewed and b-ratio and c-ratio, most skewed. The latter could not handle real data with F higher than H at more than one cut-off. With one exception, kurtosis values are all positive indicating that accuracy index distributions are more peaked than normal distributions. Hence, beta is best and, once again, b-ratio and c-ratio showing a distributions exeptionally divergent from normal. Factor analyses. Our primary analyses were based on the Pearson correlations among the 20 indices of accuracy and bias. A standard method of summarizing the pattern of inter-relations of individual difference measures is with factor analysis. Accordingly, we conducted separate factor analyses of the accuracy and bias measures followed by a joint factor analysis. All factor analyses were conducted with principal components followed by an oblimin rotation. An oblique rotation was chosen to permit estimation of the intercorrelations among the factors. In the factor analysis of accuracy measures, percent correct (pc) was not included because it is mathematically equivalent to the difference score. The eigenvalues, 7.1 and 1.1, suggest that the first factor dominated. This differential importance suggested, correctly, that the unrotated solution would be most meaningful. Table 4 contains the factor loadings. The first factor captures almost all the indices of accuracy. The second factor appears to derive entirely from the 18 19 inclusion of hit-rate. Its distinctiveness from the other accuracy measures undoubtedly derives from its contamination with bias as well as accuracy. Its double loading is therefore understandable. In the factor analysis of bias measures, k was not included because it is arithmetically equivalent to the Yes-rate. The eigenvalues of 8.2 and 1.9 indicate that the first factor predominates. Table 5 contains the factor loadings. The first factor captures most of the indices of bias, with c and log(alpha) achieving the highest loadings. Error-ratio and yes-rate were not far behind. The second factor showed high loadings for the three ratio measures. Given that the denominator of these indices involved a measure of accuracy, we suspect that this factor is an indirect, inverse indicator of accuracy. Of particular interest is the fact that beta loaded on both factors. Its highest loading was on the first factor but it also appeared to have a large degree of contamination from the second factor. The joint factor analysis included 18 indices, that is, all but percent correct and k. Table 6 contains the factor loadings. Two large eigenvalues (8.8 and 6.4) dominated the results. The interpretation of the factors is unambiguous: Factor 1 represents bias with c and yes-rate loading highest and log(b) close behind. Factor 2 represents accuracy with d΄ and difference score performing best. Not surprisingly, hit-rate and false-alarm rate loaded on both factors. DISCUSSION The factor analytic results are encouraging. The analyses for the accuracy and bias indices indicate a single predominant factor in each category. Before we provide detailed discussion, however, replication in a second sample is necessary. Such a replication is provided in Study 2. Even with replication, one might criticize the ‘bootstrapped’ nature of such factor analyses. The fact that indices converge suggests a reliability of measurement across indices. It does not 19 20 permit conclusions about what they measure. Accordingly, we added a variety of external measures of ability and self-enhancement in Study 2. Study 2: Predictive Validity of 20 SDT indices The choice of external anchors for the accuracy and bias indices must depend on the assumptions about what construct they measure. Accurate recognition in the general knowledge domain seems likely to tap crystallized intelligence (e.g., Horn & Cattell, 1966). Therefore, an IQ test would be an appropriate criterion. We used the Wonderlic Personnel Test (Wonderlic, 1992), a brief, but well-validated measure of global intelligence. A response bias toward claiming general knowledge sounds most like a facet of narcissism (e.g., Robins & John, 1997). The standard measure of subclinical narcissism is the Narcissistic Personality Inventory (Raskin & Hall, 1979; Morf & Rhodewalt, 2001; Paulhus, 1998). Participants We collected data from 114 student participants (60% female) who received extra credit points for their participation. In small groups, they completed a battery of personality tests and a general knowledge familiarity questionnaire. The instruments The SDT indices were calculated from the same general knowledge items described under Study 1. Global cognitive ability was measured with a standard IQ test, namely, the Wonderlic IQ test (Wonderlic, 1979). The instrument has accumulated substantial validity evidence to support its use in both college students (McKelvie, 1989; Paulhus, Lysy, & Yik, 1998) and general populations (Dodrill, 1981; Schoenfeldt, 1985; Schmidt, 1985). Self-enhancement was measured with the Narcissistic Personality Inventory (NPI)(Raskin & Hall, 1979). The construct validity of the NPI for measuring subclinical narcissism has received substantial support (Raskin, Novacek, & Hogan, 1991; Rhodewalt & Morf, 1995). 20 21 RESULTS Factor analyses As in Study 1, we conducted separate factor analyses of eight accuracy measures and the ten bias measures. All factor analyses were conducted with principal components followed by a varimax rotation. The factor analysis of eight accuracy measures revealed factors very similar to those in Study 1. Recall that one index was not included because of its mathematical redundancy. The two eigenvalues 6.0, 1.3 were similar to Study 1 and, again, indicated one dominant factor. To index the similarity of the factor patterns, we calculated factor congruence coefficients across samples. The coefficients were .95 and . 91 for factors 1 and 2 respectively. The cross coefficients were only .11 and .19. The factor analysis of the ten bias measures again resembled the Study 1 results. The congruence coefficients were .97 and .94, even higher than those for the accuracy indices. The cross coefficients were only -.14 and .09. In short, the factor analytic results of Study 1 were replicated in Study 2 for both accuracy and bias measures. Validity tables We compared the ten accuracy indices for convergence with an objective measure of cognitive ability, the Wonderlic IQ test. We also compared ten bias formulas for convergence with a trait measure of response bias, namely, the Narcissistic Personality Inventory. A desirable accuracy measure would show a high convergent correlation with IQ and little cross correlation with narcissism. Similarly, a desirable bias measure would show a high convergent correlation with narcissism but little cross-correlation with IQ. Such distinctive patterns are certainly possible given the orthogonality of the accuracy and bias factors and the minimal correlation of the two external criteria (-.13, n.s.). 21 22 The results are displayed in Table 7. Fortunately, most of the standard indices show the desired pattern of high convergent and low cross-correlations. All of the accuracy indices performed acceptably well with difference score and d΄ showing marginally superior patterns of convergent and discriminant validity. Among bias indices, the best overall performers are c, ca, log(b), and yes-rate. Across the accuracy and bias criteria, c and yes-rate fit the ideal pattern noted above. We also evaluated the utility of ROC fit indices as moderator variables for the predictive validity of the SDT indices. Claims have been made that a good-fit ROC-curve has superior validity for indexing accuracy (Humphreys & Swets, 1991). The inference for individual difference research is that subsets of individuals with slopes near 1.0 should behave more systematically than subsets with non-unitary slopes (Swets & Pickett, 1982; Macmillan & Creelman, 1991). Two groups were formed via a median split on ROC-line slopes. The predictive validity of each accuracy index was evaluated in both groups using moderated regression (Aiken & West, 1991; Holden, 1995). No significant interaction effects were observed. Therefore the predictor-criterion association did not differ across slope groups. Consistency of Accuracy-Bias Overlap Across Cutoffs As noted earlier, mathematical independence does not preclude the inter-correlation of indices across respondents. (To minimize confusion, we use the term overlap for the latter.) Our analyses revealed a factorial cleavage of accuracy and bias indices indicating minimal intercorrelations. But the possibility of inconsistency across cutoffs is masked by our analytic strategy of averaging across cutoff points5. Because separate analyses are possible at each of 5 The exceptions are ROC area and da. 22 23 four cutoff points, we examined the consistency of the accuracy-bias inter-correlation across different points on the ROC curve. The intercorrelations, along with the mean accuracy and bias scores are provided for each cutoff in Table 6. Our traditional pairing of d’ and criterion location, c, yields intercorrelations of -.49, -.22, +.42, and +.73 at cutoffs of 1-2, 2-3, 3-4, and 4-5, respectively. Our common-sense pair of (H-F), and (H+F)/2 yields corresponding intercorrelations of -.56, -.42, -.03, and +.50, respectively. Clearly the correlations move from negative to positive as cutoffs increase. The explanation is not to be found in a strange pattern of accuracy and bias scores: They behave appropriately across the increasing cutoffs. The accuracy scores show little change. The bias scores understandably decrease across cutoffs: Higher cutoff points are more conservative and must yield lower overall claim rates. The explanation for increasingly positive intercorrelations must lie elsewhere. Instead, consider the meaning of the negative correlation at a low cutoff. The lowest cutoff (1-2) produces a large subgroup of individuals who gave all ratings > 1. They show accuracy = 0.0 (minimum value) and bias = 1.0 (maximum value). Another subgroup of individuals will score accuracy = 0 (minimum) and bias = 0.0 (minimum). These are participants who gave all ratings < 1. At the 1-2 cutoff, however, the former group (with minimum accuracy and maximum bias) induce a negative association between accuracy and bias. The inverse is true at the highest cutoff (4-5), where a large group of individuals exhibit maximum accuracy and a minimum bias scores because all their responses were less than 5. Such complexities are hardly obvious when a researcher is faced with the choice of performing SDT calculations at single or multiple cutoffs. 23 24 Least-squares vs. maximum likelihood estimation. There are two indices where LS and ML estimates are most likely to differ, namely, those that involve fitting the ROC curve with slope and intercept: ROC-area, da, and Ag. The conditions where LS and ML differ are those where the LS solutions depart from linearity. The value of R2, which gauges linearity, might be used as a moderator variable. Specifically, indices calculated on low R2 participants should be less valid and yield poorer correlations with criterion variables. To permit the comparison, we analyzed a sample of 50 cases in three ways. First, we calculated the least square estimates in our SPSS program. Then we used R-SCORE II (Dorfman, 1982, based on Dorfman & Alf, 1969) to calculate the maximum likelihood estimates. Across the 50 cases, the inter-correlation of the two methods was .99 for ROC-area and .98 for Ag. We conclude that the usual preference for ML over initial LS estimates is unjustified in measuring individual differences in knowledge. GENERAL DISCUSSION Our goal was to evaluate the spate of available SDT indices with respect to their utility in measuring individual differences. The nature of the marshaled evidence goes well beyond previous literature comparing indices for use in the comparison of means. It included two studies of individual differences in responses to general knowledge items. Restricting the domain to knowledge items entails some loss in generalizability; on the other hand, it gave us the opportunity to detail a variety of novel analytic approaches. Because of our emphasis on individual differences, these analyses focused on correlations among indices and across respondents. 24 25 Before we attempt to reconcile and integrate the three categories of criteria, we will review the empirical results. Recall that three analytic strategies were pursued: distributional analyses, factor analyses, and validation of the indices against external criteria. Distributional properties. The evaluation of indices for individual difference applications requires special attention to distributional properties. In the analysis of experimental data, the means of odd distributions can be finessed by trimming of outliers. But correlations are heavily dependent on the extreme values. Restriction of range has a systematic detrimental effect on the value of correlations (e.g., Gulliksen, 1967). In short, we can’t simply toss out extreme scores. When calculated on our broad samples of students, our results revealed dramatic differences in the distributions of various indices – both in the accuracy and bias subsets. Although representing the SDT accuracy of the same respondents, indices such as hit-rate, d’, and log(alpha), exhibited near-normal distributions but q did not. Similarly, among the bias measures, false-alarm rate, b-ratio, and c-ratio, showed especially non-normal distributions. Overall, the results raise questions about about the use of q, false-alarm rate, and the two ratio indices. We will not deal here with the possible transformation of indices to render them more useful. The ability to incorporate extreme cases without transformation is a commendable quality for a statistical index. Nonetheless, on their own, such distributional properties are not critical criteria for selecting SDT indices. Reliability and validity criteria take precedence. Therefore, we included all the indices in our correlational analyses. Correlational structure. Our factor analyses of accuracy indices were conducted to reveal how many subtypes lurked within in the spate of common accuracy and bias indices. Although all accuracy and bias indices are ultimately based on H and F, the diversity in how they 25 26 contribute to the various formulae prohibit any easy prediction of the size and direction of their correlations across individuals Nonetheless, our factor analyses of ten accuracy measures yielded coherent results. All indices loaded highly on the first unrotated factor. Although many of these indices have been criticized (e.g., Macmillan & Creelman, 1991; Swets, 1986), they all seem effective from an individual differences perspective. This result should be reassuring to those researchers who apparently made arbitrary choices of indices in their previously published work: Interchanging any of these standard accuracy indices would make little difference to their results. The clear exception to this blanket reassurance is the hit-rate: Its double loading (on both the accuracy and bias factors) sustains the fear that this index has a serious confound, undoubtedly, its corresponding false alarm rate. Overall, the difference score showed the highest primary loading and a minimal crossloading on the second factor. Indirectly, this winning performance also supports the use of percent correct as an accuracy index because it is mathematically equivalent to the difference score. This finding may surprise some SDT traditionalists who assume that the more sophisticated and mathematically complex indices will always out-perform their less sophisticated competitors. Our factor analyses of the nine bias measures also yielded coherent results. Most measures loaded strongly on the first unrotated factor indicating that they tap a common construct. Indices c and log(alpha) had the highest loadings. A second factor also emerged with a clear interpretation. The high loadings for all three ratio measures suggests that the second factor harbors the cognitive ability contamination that is built into those indices.6 6 Recall that ratio measures involve division by an accuracy index. 26 27 The joint factor analyses were particularly compelling. Factor 1 captured the common component of all the accuracy indices and Factor 2 captured the corresponding common component of bias indices. The results from this analysis provide a global confirmation that the traditional distinction between accuracy and bias indices is viable when applied to individual difference measurement. At the same time, the fact that this distinction cannot be taken for granted is demonstrated in poor performers such as beta and hit-rate: Their double loadings raise serious questions about future usage in individual difference research. Overlap of accuracy and bias indices. The relative independence of accuracy and bias indices in our structural analyses may have been a serendipitous consequence of our strategy of averaging across cutoffs. When we examined the intercorrelations of our index pairs across cutoffs, the values showed a huge range from highly negative at low cutoffs to highly positive at high cutoffs. Hence our averaging approach, though more tedious than opting for a single cutoff, has another important advantage – accuracy-bias independence. Other solutions are possible: Based on our data, moderate cutoffs yield low intercorrelations. Hence researchers might use a 2-3 or 3-4 cutoff on 5pt response scales and 3-4 or 4-5 cutoff for on 7-pt scales. In the case of dichotomous response scales (e.g., Yes-No), stimuli yielding proportions near 50-50 response rates are likely to show the lowest accuracybias overlap. Perhaps new SDT index pairs could be developed – ones that show minimal overlap at all cutoffs. It is worth reminding the reader that accuracy-bias overlap is a serious problem in individual difference research. If the researcher cannot distinguish between accurate individuals and biased individuals, the central purpose of SDT methods is compromised. Even if both are entered into a 27 28 regression equation, the overlapping portion may bounce back and forth from the accuracy index to the bias index across different samples (Aiken & West, 1995; Holden, 1995). Validity. The validity results were also consistent with the factor analyses. Indices that best defined the accuracy factor also best predicted the accuracy criterion, namely, scores on a global IQ test. In the domain of general knowledge, our accuracy indices are best viewed as tapping crystallized intelligence (Horn & Cattell, 1966). Hence these validities would have been even higher had we used a criterion test of crystallized intelligence (see Rolfhus & Ackerman, 1999). Moreover, our use of a relatively brief measure (the Wonderlic IQ test)), suggests that our range of validities (.29 - .48) are likely to be underestimates. In parallel, SDT indices that best defined the bias factor also best predicted the bias criterion, namely, scores on the Narcissistic Personality Inventory. The NPI is the undisputed method of choice for measuring subclinical narcissism (Gosling, John, Craik, & Robins, 1998; Raskin, Novacek, & Hogan, 1991; Morf & Rhodewalt, 2001; Paulhus, 1998; John & Robins, 1994). From a construct point of view, these validity data are the most valuable evidence in our report. They indicate that SDT detection indices calculated in the knowledge familiarity paradigm tap the well-established constructs of global cognitive ability and trait selfenhancement. In evaluating SDT indices in other domains, it is likely that a different set of criterion indices will be appropriate. The availability of external criteria also permitted an evaluation of the utility of ROC fit indices. Claims have been made about the superior validity of good-fit ROC curves for indexing accuracy (Humphreys & Swets, 1991). The inference for individual difference research is that subsets of individuals with slopes near 1.0 should behave more systematically than subsets with non-unitary slopes. In general, we ignored such complexities by averaging across all six cutoffs 28 29 for most indices. The two accuracy indices based on ROC curves – ROC-area and da – should be invulnerable to slope differences because they measure properties of the entire curve. Our data appeared to reveal robustness for the two recommended pairs as well as ROC- area and da. Separation of individuals with non-unitary from unitary slopes revealed no difference in predictive validity of external criterion variables. This conclusions is tentative and more elaborate research is necessary to pursue this issue. CONCLUSIONS AND FUTURE DIRECTIONS Overall, our empirical results dovetail with our non-empirical arguments. They are also consistent with conclusions from previous comparative analyses of accuracy indices (Swets, 1986) and bias indices (Macmillan & Creelman, 1990). It is clear that some SDT indices are better than others for the measurement of individual differences. Overall, we favor two pairs of indices – for somewhat different reasons. First, we recommend the venerable d΄ and c. They performed well empirically and have a solid theoretical and adequate practical justification. Beta, the traditional partner of d’, however, should not be used. Its theoretical and practical deficits have been noted before (e.g., Ingham, 1970), yet, have largely been ignored. Here, the failure of beta as an individual difference measure is clear in its small and erratic associations with external measures of ability and bias. Even more telling, our factor analyses revealed that beta is not tapping the same latent construct as the other bias measures. Our second recommended combination of indices is what we call the ‘common-sense’ pair, namely, difference score and yes-rate.7 As implied by the label, their intuitive appeal is indisputable. Their calculation could not be more straightforward. And their empirical 7 A pair that is mathematically equivalent is percent correct and k. 29 30 performance was surprisingly good. Both indices loaded strongly on their corresponding accuracy and bias factors with only minor cross-loadings. Finally, their convergence with appropriate external criteria was excellent. A theoretical bonus for pairing these two indices is their common theoretical association – even if that theory is a variety of High-Threshold Theory, which has been criticized in other contexts (e.g., Coombs, Dawes, & Tversky, 1965, p.183-190). Note that these two indices are mathematically equivalent to the pairing of percent correct and k. Yet our recommendation for the latter pair is more qualified. Although percent correct is also intuitively clear, its calculation is somewhat more complicated than difference score. And k enjoys neither advantage. Finally, it may be surprising to some readers that the common sense measures perform so well despite being so simple. One answer is that they are monotonic functions of the traditional d’ and c indices, which also performed well. Although monotonically related, our two recommended pairs originate in different theories of detection. The traditional pair assumes Gaussian distributions of both signal and noise. The common-sense pair are linked to HighThreshold Theory (Blackwell, 1953; Macmillan & Creelman, 1991). Therefore the utility of both theories is conjointly supported by our results. Least-squares (LS) versus maximum likelihood (ML) solutions. Maximum likelihood is ideal for fitting ROC curves because the fitting task involves variability on both X and Y coordinates. Therefore the estimation of indices such as ROC area and da may differ between least-squares and maximum likelihood methods. Unfortunately, ML methods require iterative computer programs such as RSCORE (Dorfman, 1982). In the past, therefore, the choice between the two methods has been viewed as a trade-off of precision with practicality. 30 31 When we compared results from LS and ML estimations, however, we found little difference. We conclude that, for the measurement of individual differences, the precision of LS estimates is sufficient. Limitations and Future Research. How far can our conclusions about bias indices be generalized given that our data were derived from general knowledge items? Would the same individuals who show a strong bias toward saying “yes, that looks familiar” also show a bias toward saying “yes, I see that” and “yes, I have insight into that person”? If so, then our theoretical and practical arguments should apply to other domains as well. Research on the generality of overconfidence across knowledge and perception data seems to come down on the side of generality: Recent work has suggested that overconfidence indices from a variety of domains tend to coalesce into one general factor of individual differences (Bornstein & Zickafoose, 1999; Oakley, 1998; Stankov, 2000; West & Stanovich, 1997). What about the social judgment domain? Recent work has also shown a similar overconfidence in that domain too (Dunning, Ross, & Griffin, 1990; Griffin & Beuhler, 1999), but individual differences have yet to be studied. We do know that individual differences in bias are extensive in the social judgment domain (Hilgendorf & Irving, 1978). In short, we have hints about a general factor of individual differences in overconfidence. On this basis, we suspect that SDT bias indices, like overconfidence indices, will converge across domains to reveal a general factor of response bias. With regard to accuracy, however, it is not at all clear that indices derived from perceptual tasks would show the same clear associations with IQ test score. It doesn’t seem to hold across knowledge and eye-witness domains (Bornstein & Zickafoose, 1999). A century of research beginning with Spearman (1904) has provided no evidence – at least within the normal range -- that measures of ‘g’ predict 31 32 the ability to distinguish percepts. And identifying the “good judge’ in the social domain has been especially elusive (Cronbach, 1955; Funder & Colvin, 1997). Within the knowledge domain, we hope to connect SDT bias indices to calibration and resolution, as well as overconfidence tendencies. Another response bias, the tendency to give high ratings to self and others, seems to generalize across targets (Kenny, 1994). But the tendency to give high ratings to self and others seems to hold only for agreeableness ratings (Kenny, 1994; Paulhus & Reynolds, 1995). Our own research is now being directed toward large samples studies that include both SDT and overconfidence measures across three stimulus domains: social judgment, perceptual judgment, and knowledge familiarity. 32 33 REFERENCES Ackerman, P.L. (1996). A theory of adult intellectual development: Process, personality, interests, and knowledge. Intelligence, 22, 227-257. Aiken, L.S, & West, S.G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park: Sage. Ashby, F.G., & Townsend, J.T. (1986). Varieties of perceptual independence. Psychological Review, 93, 154-179. Baer, R.A., Rinaldo, J.C., & Berry, D.T.R. (2002). Response distortions in self-report assessment. In R. Fernandez-Ballesteros (Ed.), Encyclopedia of Psychological Assessment. London: Sage. Baranski, J.V., & Petrusic, W.M. (1994). The calibration and resolution of confidence in perceptual judgments. Perception and Psychophysics, 55, 412-428. Baranski, J.V., & Petrusic, W.M. (1995). On the calibration of knowledge and perception. Canadian Journal of Experimental Psychology, 49, 397-407. Baranski, J.V. & Petrusic, W.M. (1999). Realism of confidence in sensory discrimination. Perception & Psychophysics, 61, 1369-1383. Bar-Tal, Y., Sarid, A., & Kishon-Rabin, L. (2001). A test of the overconfidence phenomenon using audio signals. Journal of General Psychology, 128, 76-80. Bjorkman, M., Juslin, P., & Winman, P. (1993). Realism of confidence in sensory discrimination: The underconfidence phenomenon. Perception & Psychophysics, 54, 7581. Blackwell, H.R. (1953). Psychophysical thresholds: Experimental studies of methods of measurement. Bulletin of Engineering Research Institute, University of Michigan (36). 33 34 Block, J. (1965). The challenge of response sets. New York: Century. Bornstein, B.H., & Zickafoose, D.J. (1999). "I know I know it, I know I saw it": The stability of the confidence-accuracy relationship across domains. Journal of Experimental Psychology: Applied, 5, 76-88. Cohen, D.J. (1997). Visual detection and perceptual independence: Assessing color and form. Perception and Psychophysics, 59, 623-635. Commons, M.L., Nevin, J.A., & Davison, M.C. (Eds.)(1991). Signal detection: Mechanisms, models, and applications. Hillsdale, NJ: Erlbaum. Coombs, C., Dawes, R.M., & Tversky, A. (1965). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall. Cronbach, L.J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity”. Psychological Bulletin, 52, 177-193. Danziger, P.R., & Larsen, J.D. (1989). Personality dimensions and memory as measured by signal detection. Personality and Individual Differences, 10, 809-811. Dodrill, C.B. (1981). An economical method for the evaluation of general intelligence in adults. Journal of Consulting and Clinical Psychology, 49, 668-673. Dorfman, D.D. (1982). RSCORE-II. Fortran program provided in J.A. Swets & R.M. Pickett, Evaluation of diagnostic systems: Methods from signal detection theory (pp.208-225). New York: Academic Press. Dorfman, D.D., & Alf, E. (1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals-rating method data. Journal of Mathematical Psychology, 6, 487-496. 34 35 Dunbar, G. C., & Lishman, W. A. (1984). Depression, recognition-memory and hedonic tone: A signal detection analysis. British Journal of Psychiatry, 144, 376-382. Dunlap, W.P., Burke, M.J., & Greer, T. (1995). The effect of skew on the magnitude of productmoment correlations. Journal of General Psychology, 122, 365-377. Dunning, D., Griffin, D. W., Milojkovic, J.D., & Ross, L. (1990). The overconfidence effect in social prediction. Journal of Personality and Social Psychology, 58, 568-581. Egan, J.P., Schulman, A. L., Greenberg, G.Z. (1959). Operating characteristics determined by binary decisions and by ratings. Journal of the Acoustical Society of America, 3, 768-773. Estes, W.K. (1956). The problem of inference from curves based on group data. Psychological Bulletin, 53, 134-140. Funder, D.C. (1995). On the accuracy of interpersonal judgments: A realistic approach. Psychological Review, 102, 652-670. Funder, D.C., & Colvin, C.R. (1997). Congruence of others’ and self-judgments of personality. In R. Hogan, J. Johnson, & S.R. Briggs (Eds.), Handbook of personality psychology (pp.617-647). San Diego: Academic Press. Gardiner, J.M., Ramponi, C., & Richardson-Klavehn, A. (1999). Response deadline and subjective awareness in recognition memory. Consciousness and Cognition: An International Journal, 8, 484-496. Geen, R.G., McCown, E.J., & Broyles, J.W. (1985). Effects of noise on sensitivity of introverts and extraverts to signals in a vigilance task. Personality and Individual Differences, 6, 237241. Gigerenzer, G., Hoffrage, U., & Kleinbolting, H. (1991). Probabalistic mental models: A Brunswikian theory of confidence. Psychological Review, 98, 506-528. 35 36 Glass, G.V., & Hopkins, K.D. (1996). Statistical methods in education and psychology (3rd ed.). Needham Heights, MA: Allyn & Bacon. Gopher, D., Itkin-Webman, T., Erev, I., Meyer, J., & Armony, L. (2000). The effect of shared responsibility and competition in perceptual games: A test of cognitive game-theoretic extension of signal-detection theory. Journal of Experimental Psychology: Human Perception and Performance, 26, 325-341. Gosling, S., John, O.P., Craik, K.H., Robins, R.W. (1998). Do people know how they behave?: Self-reported act frequencies compared with on-line coding by observers. Journal of Personality and Social Psychology, 74, 1337-1349. Griffin, D., & Buehler, R. (1999). Frequency, probability, and prediction: Easy solutions to cognitive illusions? Cognitive Psychology, 38, 48-78. Griffin, D., & Tversky, A. (1992). The weighing of evidence and the determinants of confidence. Cognitive Psychology, 24, 411-435. Grossberg, J.M., & Grant, B.F. (1978). Clinical psychophysics: Applications of ratio scaling and signal detection methods to research on pain, fear, drugs, and medical decision making. Psychological Bulletin, 85, 1154-1176. Gulliksen, H. (1967). Theory of mental tests. New York: Wiley. Harvey, L.O. (1992). The critical operating characteristic and the evaluation of expert judgment. Organizational Behavior and Human Decision Processes, 53, 229-251. Hilgendorf, E. L., & Irving, B. L. (1978). Decision criteria in person recognition. Human Relations, 31, 781-789. Hirsch, E.D. (1988). Cultural literacy. New York: Vintage. 36 37 Holden, R.R. (1995). Moderated multiple regression. In R. Corsini (Ed.), Encyclopedia of psychology (2nd ed)(vol2, pp.422-423). New York: Wiley. Hogan, R, & Nicholson, R.A. (1992). The meaning of personality test scores. American Psychologist, 43, 621-626. Horn, J. L. & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crystallized intelligence. Journal of Educational Psychology, 57, 253-270. Humphreys, L.G., & Swets, J.A. (1991). Comparison of predictive validities measured with biserial correlations and ROCs of signal detection theory. Journal of Applied Psychology, 76, 316-321. Ingham, J. G. (1970). Individual differences in signal detection. Acta Psychologica, 34, 39-50. Inoue, C., & Bellezza, F.S. (1998). The detection model of recognition using know and remember judgments. Memory and Cognition, 26, 299-308. Jackson, D.N., & Messick, S. (1961). Acquiescence and desirability as response determinants on the MMPI. Educational and Psychological Measurement, 21, 771792. John, O.P., & Benet-Martínez, V. (2000). Measurement, scale construction, and reliability. In H.T. Reis and C.M. Judd (Eds.), Handbook of research methods in social and personality psychology (pp. 339-369). New York, NY: Cambridge University Press. John, O. P, & Robins, R.W. (1994). Accuracy and bias in self-perception: Individual differences in self-enhancement and the role of narcissism. Journal of Personality and Social Psychology, 66, 206-219. Juslin, P., & Olsson, H. (1997). Thurstonian and Brunswikian origins of uncertainty in judgment: A sampling model of confidence in sensory discrimination. Psychological Review, 104, 344-366. 37 38 Keren, G. (1991). Calibration and probability judgments: Conceptual and methodological issues. Acta Psychologica, 77, 217-273. Knowles, E. S., & Condon, C. A. (1999). Why people say "yes": A dual-process theory of acquiescence. Journal of Personality and Social Psychology, 77, 379-386. Koriat A, Goldsmith M, Pansky A. (2000). Toward a psychology of memory accuracy. Annual Review of Psychology, 51, 481-537. Lanning, K. (1989). Detection of invalid response patterns on the California Psychological Inventory. Applied Psychological Measurement, 13, 45-56. Lautenschlager, G. J. (1994). Accuracy and faking of background data. In G.S. Stokes & M.D. Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical information in selection and performance prediction. (pp. 391-419). Palo Alto: CPP Books. Leelavathi, G., & Venkatramaiah, S. R. (1976). Effect of anxiety on certain parameters of the theory of signal detectability (TSD). Indian Journal of Clinical Psychology, 3, 35-39. Lichtenstein, S.B., & Fischoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20, 159-183. Lindsay, D.S. (1997). In J.D. Read & D.S. Lindsay (Eds.), Recollections of trauma: Scientific evidence and clinical practice (pp. 1-24). New York, NY: Plenum Press. Macmillan, N. A., & Creelman, C. D. (1991). Detection theory: A user's guide. New York: Cambridge. Macmillan, N.A., & Creelman, C.D. (1990). Response bias: Characteristics of detection theory, threshold theory, and nonparametric indices. Psychological Bulletin, 107, 401-413. Macmillan, N.A., & Kaplan, H.L. (1985). Detection theory analysis of group data: Estimating sensitivity from average hit and false-alarm rates. Psychological Bulletin, 98, 185-199. 38 39 Martin, W. W., & Rovira, M. L. (1981). Signal detection theory: Its implications for social psychology. Personality and Social Psychology Bulletin, 7, 232-239. MacLennan, R.N. (1988). Correlation, base-rates, and the predictability of behaviour. Personality and Individual Differences, 9, 675-684. McFall, R.M., & Treat, T.A. (1999). Quantifying the information value of clinical assessments with signal detection theory. Annual Review of Psychology, 50, 215-241. Messick, S. (1967). The psychology of acquiescence: An interpretation of research evidence. In I. A. Berg (Ed.), Response set in personality assessment (pp.115-145). Chicago: Aldine. Morf, C.C., & Rhodewalt, F. (2001). Unraveling the paradoxes of narcissism: A dynamic selfregulatory processing model. Psychological Inquiry, 12, 177-196. Oakley, K.J. (1998). Age and individual differences in the realism of confidence. Unpublished dissertation, Carleton University, Ottawa, Canada. Ozer, D. J. (1993). Classical psychophysics and the assessment of agreement and accuracy in judgments of personality. Journal of Personality, 61, 739-767. Park, K.B., Lee, H., & Lee, S. (1996). Monotonous social environments and the identification of crime suspects: A comparison of police cadets and civilian college students. Psychological Reports, 79, 647-654. Paulhus, D. L. & Levitt, K. (1987). Desirable responding triggered by affect: Automatic egotism? Journal of Personality and Social Psychology, 52, 245-259. Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 17-59). San Diego, CA: Academic Press. 39 40 Paulhus, D. L. (1998). Interpersonal and intrapsychic adaptiveness of trait self-enhancement: A mixed blessing? Journal of Personality and Social Psychology, 74, 1197-1208. Paulhus, D. L., & Bruce, N. (1990). Validation of the OCQ: A preliminary study. Presented at the meeting of Canadian Psychological Association, Ottawa, June 1990. Paulhus, D.L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46, 598-609. Paulhus, D.L., & Harms, P.D. (2004). Measuring cognitive ability with the over-claiming technique. Intelligence, 32, 297-314. Paulhus, D.L., Harms, P.D., Bruce, M.N., & Lysy, D.C. (2003). The over-claiming technique: Measuring self-enhancement independent of ability. Journal of Personality and Social Psychology, 84, 681-693. Paulhus, D.L., & Reynolds, S. (1995). Enhancing validity and agreement in personality impressions: Putting the person back into person perception. Journal of Personality and Social Psychology, 69, 1233-1242. Paulhus, D.L., Lysy, D., & Yik, M. S M. (1998). Self-report measures of intelligence: Are they useful as proxy measures of IQ. Journal of Personality, 66, 525-554. Petrusic, W.M. & Baranski, J.V. (1997). Context effects in the calibration and resolution of confidence. American Journal of Psychology, 110, 543-572. Price, R. H. (1966). Signal-detection methods in personality and perception. Psychological Bulletin, 66, 55-62. Raskin, R., & Hall, C.S. (1979). A Narcissistic Personality Inventory. Psychological Reports, 45, 590. 40 41 Raskin, R.N., Novacek, J. & Hogan, R.T. (1991). Narcissism, self-esteem and defensive selfenhancement. Journal of Personality, 59, 19-38. Rentfrow, P. J., & Gosling, S. D. (in press). The do re mi’s of everyday life: Examining the structure and personality correlates of music preferences. Journal of Personality and Social Psychology. Rhodewalt, F, & Morf, C. C. (1995). Self and interpersonal correlates of the Narcissistic Personality Inventory: A review and new findings. Journal of Research in Personality, 29, 1-23. Robins, R. W., & John, O. P. (1997). Effects of visual perspective and narcissism on selfperception: Is seeing believing? Psychological Science, 8, 37-42. Rolfhus, E.L., & Ackerman, P. L. (1996). Self-report knowledge: At the crossroads of ability, interest, and personality. Journal of Educational Psychology, 88, 174-188. Rolfhus, E.L., & Ackerman, P.L. (1999). Assessing individual differences in knowledge, intelligence, and related traits. Journal of Educational Psychology, 91, 511-526. Saucier, G. (1994). Separating description and evaluation in the structure of personality attributes. Journal of Personality and Social Psychology, 66, 141-154 Schmidt, F.L. (1985). Review of the Wonderlic Personnel Test. In J.V. Mitchell (Ed.), The ninth mental measurements yearbook (pp. 1755-1757). Lincoln, NE: Buros Institute. Schmidt, F.L., & Hunter, J.E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274. Schoenfeldt, L.F. (1985). Review of Wonderlic Personnel Test. Ninth Mental Measurements Yearbook (pp.1757-58). Lincoln, NE: Buros Institute of Mental Measurement 41 42 Schulman, A.I., & Mitchell, R.R. (1966). Operating characteristics from yes-no and forcedchoice procedures. Journal of the Acoustical Society of America, 40, 473-477. Simpson, A.J., & Fitter, M.J. (1973). What is the best index of detectability? Psychological Bulletin, 80, 481-488. Sinha, R.R., & Krueger, J. (1998). Idiographic self-evaluation and bias. Journal of Research in Personality, 32, 131-155. Spearman, C. (1904). General intelligence objectively determined and measured. American Journal of Psychology, 5, 201-293. Stankov, L. ( 2000). The trait of self-confidence. International Journal of Psychology, 35, 305. Stankov, L., & Crawford, J.D. (1996). Confidence judgments in studies of individual differences. Personality and Individual Differences, 21, 971-986. Stankov, L., & Crawford, J.D. (1997). Self-confidence and performance on tests of cognitive abilities. Intelligence, 25, 93-109. Stanovich, K.E., & Cunningham, A.E. (1993). Where does knowledge come from? Specific associations between print exposure and information acquisition. Journal of Educational Psychology, 85, 211-229. Stanovich, K.E., & West, R.F. (1998). Individual differences in rational thought. Journal of Experimental Psychology: General, 127, 161-188. Stelmack, R. M., Wieland, L.D., Wall, M.U., & Plouffe, L. (1984). Personality and the effects of stress on recognition memory. Journal of Research in Personality, 18, 164-178. Stenson, H., Kleinmuntz, B., & Scott, B. (1975) Personality assessment as a signal detection task. Journal of Consulting and Clinical Psychology, 43, 794-799. 42 43 Stricker, L.J. (2000). Using just noticeable differences to interpret test scores. Psychological Methods, 5, 415-424. Swets, J.A. (1964). Signal detection and recognition by human observers. New York: Wiley. Swets, J.A. (1986). Indices of discrimination or diagnostic accuracy: Their ROCs and implied models. Psychological Bulletin, 99, 100-117. Swets, J.A. (2000). Separating discrimination and decision in detection, recognition, and matters of life and death. In D. Scarborough & S. Sternberg (Eds.), Methods, models, and conceptual issues: An invitation to cognitive science (Vol. 4) (pp. 635-702). Cambridge, MA: MIT Press. Swets, J.A., & Pickett, R.M. (1982). Evaluation of diagnostic systems. New York: Academic Press. van Egeren, L. (1968). Repression and sensitization: sensitivity and recognition criteria. Journal of Experimental Research in Personality, 3, 1-8. Wickelgren W.A. (1968). Unidimensional strength theory and component analysis of noise in absolute and comparative judgments. Journal of Mathematical Psychology, 5, 102-122. Wiggins, J. S. (1973). Personality and prediction: Principles of personality assessment. Reading MA: Addison-Wesley. Watson, D., & Clark, L.A. (1992). On traits and temperament: General and specific factors of emotional experience and their relation to the five-factor model. Journal of Personality, 60, 441-476. West, R.F., & Stanovich, K.E. (1997). The domain specificity and generality of overconfidence: Individual differences in performance estimation bias. Psychonomic Bulletin and Review, 4, 387-392. 43 44 Wonderlic, E.F. (1992). Wonderlic Personnel Test and scholastic level exam user's manual. Libertyville, IL: Wonderlic Personnel Test, Inc. Yonelinas, A. P. (2002). The nature of recollection and familiarity: A review of 30 years of research. Journal of Memory and Language, 46, 441-517. Yu, J. (1990). Comparing SDT and IRT with Monte Carlo approach. Acta Psychologica Sinica, 22, 205-210. 44 45 Table 1. Available indices of accuracy and bias from Signal Detection Theory Crude Commonsense Accuracy Response Bias H F (H – F) (H + F)/2 Equivalent indices Percent correct K Traditional d’ Beta, c Gaussian ROC-area ROC-derived Logistic Miscellaneous da A’ B’’ log(alpha) log(b) q ca b-ratio; c-ratio error ratio Note. Definitions are provided in Appendix 1. 45 46 Table 2. Descriptive statistics on Accuracy Indices Mean S.D. Skewness Kurtosis hit-rate .484 .13 -.07 -.09 d’ 1.09 .53 -.29 .97 difference score .279 .14 -.59 1.88 percent correct .640 .07 -.59 1.88 A’ .749 .08 -.93 .81 log(alpha) .442 .22 -.15 .59 q .314 .27 -4.03 22.44 da .658 .41 -.15 ROC area .673 .10 -.71 2.20 2.07 Note. N = 130 46 47 Table 3. Descriptive statistics on Response Bias Indices Mean S.D. Skewness Kurtosis false-alarm rate .205 .015 1.45 2.07 beta 4.44 2.91 .82 .11 c .593 .467 -.71 .69 ca .472 .415 -.41 1.03 B’’ .394 .226 -.23 -.55 yes-rate .344 .125 .79 .79 k .155 .125 -.79 .79 log(b) .476 .376 -.64 .61 error-ratio .736 .155 -.93 .52 b-ratio .344 4.48 -9.60 106.2 c-ratio .372 2.52 -6.17 52.2 Note. N = 130. 47 48 Table 4 Factor analysis of Eight Accuracy Indices. Component 1 2 hit-rate .566 .798 d’ .932 -.304 difference score .969 .086 A’ .949 -.220 log(alpha) .884 -.411 q .933 .283 da .858 .041 ROC area .869 .024 Note. N = 130 48 49 Table 5. Factors of Response Bias Measures Component 1 2 false-alarm rate -.905 .195 beta .707 -.492 c .987 -.001 ca .923 .089 B’’ .901 -.360 yes-rate -.950 -.108 log(b) .987 -.036 error-ratio .975 -.113 b-ratio .516 .827 c-ratio .448 .858 Note. N = 130. K is omitted because of collinearity with yes-rate. 49 50 Table 6. Joint factor Analysis of Accuracy and Bias Measures. Component 1 2 hit-rate -.824 .534 d’ .249 .953 difference score -.110 .959 A’ .180 .954 log(alpha) .349 .913 q -.300 .905 da -.037 .827 ROC area -.021 .842 false-alarm rate -.812 -.530 beta .596 .601 c .987 .107 ca .937 .039 50 51 B’’ .822 .469 yes-rate -.981 .045 log(b) .978 .153 error-ratio .944 .241 b-ratio .585 -.296 c-ratio .528 -.346 Note. N = 130 51 52 Table 7. Correlations of all Indices with External Criterion Measures of Accuracy and Bias. IQ Narcissism Accuracy Indices hit-rate .289 .293 d’ .477 .008 difference score .475 .129 percent correct .475 .129 A’’ .460 .055 log(alpha) .409 -.030 q .475 .189 da .422 .018 ROC area .443 -.001 Response bias indices false-alarm rate -.235 .180 beta .242 -.180 c .015 -.296 ca -.031 -.307 B’’ .173 -.245 yes-rate .052 .288 K -.052 -.288 log(b) .035 -.296 error-ratio .084 -.279 b-ratio -.107 -.104 c-ratio -.130 -.059 52 53 Table 8. Accuracy, Bias and Intercorrelations across the Four Cutoff Points. Cutoff point on 5-point familiarity rating scale 1-2 2-3 3-4 4-5 Accuracy (d’) .24 .31 .31 .25 Bias (c) .54 .39 .27 .18 Intercorrelation -.56 -.42 -.03 +.50 Accuracy (H-F) .29 .36 .36 .28 Bias (H+F)/2 .50 .36 .25 .18 Intercorrelation -.49 -.22 +.42 +.73 TRADITIONAL PAIR COMMONSENSE PAIR Note. N = 114. 53 54 APPENDIX I Here we derive the form of the ROC curve from first principles and put in perspective the variety of other indices of sensitivity and bias that we explore. We begin by assuming that the memory (sensory) information arising upon presentation of a target (a signal, denoted s) is represented by a Gaussian distributed random variable Xs with expectation μs and variance s2 . Similarly, the presentation of a distractor (noise, denoted by n) gives rise to an internal effect which can be viewed as a random sample from a Gaussian distributed random variable with expectation μn and variance n2 . Letting c denote a criterion such that the “Yes” response occurs whenever the evidence variable x c and the “No” response occurs whenever x c . The Probability of a Hit is thus defined by P(Yes | s) P( H ) P( x c | s) and the Probability of False Alarm is defined by P(Yes | n) P( FA) P( x c | n) . Following Wickelgren (1968) we define a Tails Normal Deviate (TND) as TND( x) x z , which is simply a standardized Gaussian random variable but with sign reversed. The TND corresponding to the Probability of a Hit is given by, zH s c , s (1) and the TND corresponding to the Probability of a False Alarm is given by z FA n c . n (2) Upon eliminating the criterion value, c, from Equations 1 and 2, and after some algebra the Yes-No ROC on TND-TND coordinates is given by: 54 55 ZH ( s n ) s n Z FA . s (3) The Simpson and Fitter (1973), da, Yes-No index of sensitivity is given by: d a 2d 2' /(1 s 2 )1 / 2 , (4) where d 2' s n and s2 1 , n2 s 2 , and d1' ( s n ) / s Z FA . The sensitivity index, d a d ' , the “conventional” index, when the variances of the signal and noise distribution are equal (i.e., n2 s 2 1), and the ROC curve on TND- TND coordinates then has unit slope. As Macmillan and Creelman (1991, pp. 68-69) show, the Simpson and Fritter (1973) da index is closely related to the Shulman and Mitchell (1966), DYN , index. DYN defines the perpendicular from the origin to the ROC, on TND-TND coordinates, and da is the distance defined by the length of the hypotenuse of the equilateral triangle with legs of length DYN . Thus, d a 2 DYN . When an index defined by the proportion of the area under the Yes-No ROC curve is desired, A( Z ) ( DYN ) ( d a / 2 ) can be obtained, where ( DYN ) defines the area under the standard Gaussian density to the left of the point DYN . 55