Review of Faking in Personnel Selection Michael A. McDaniel Virginia Commonwealth University mamcdani@vcu.edu Deborah L. Whetzel Human Resources Research Organization dwhetzel@humrro.org Chris D. Fluckinger University of Akron cdf12@uakron.edu Prepared for: International Workshop on “Emerging Frameworks and Issues for S&T Recruitments” Society for Reliability Engineering, Quality and Operations Management (SREQOM) Delhi, India September, 2008 We note that Chris D. Fluckinger is the senior author of our book chapter associated with this conference. Although not present at the conference, his contributions to this presentation were substantial. SREQOM, Delhi, India 2 Goal of this Presentation Provide practitioners and researchers with a solid understanding of the practical issues related to faking in test delivery and assessment. SREQOM, Delhi, India 3 Overview Typical vs. maximal performance The usefulness of different strategies to identify faking How faking creates challenges to test delivery and measurement Review and critique of common strategies to combat faking SREQOM, Delhi, India 4 Faking Faking is a conscious effort to improve one’s score on a selection instrument. Faking has been described using various terms including: Response distortion Social desirability Impression management Intentional distortion, and Self enhancement Hough, Eaton, Dunnette, Kamp, & McCloy (1990); Lautenschlager, (1994); Ones, Viswesvaran & Korbin (1995). SREQOM, Delhi, India 5 Maximal vs. Typical Performance Faking can be understood by comparing the distinction between maximal and typical performance. Cronbach, (1984) This distinction is useful in understanding faking. SREQOM, Delhi, India 6 Maximal Performance Maximal performance tests assess how respondents perform when doing their best. A mathematics test of subtraction is an assessment of maximal performance in that one is motivated to subtract numbers as accurately as one is able. Cognitive ability and job knowledge tests are also maximal performance measures. SREQOM, Delhi, India 7 Maximal Performance In high stakes testing, such as employment testing, people are motivated to do their best, that is, to provide their maximal performance. In high stakes testing, both those answering honestly and those seeking to fake have the same motivation: Give the correct answer. One can guess on a maximal performance test but one cannot fake. SREQOM, Delhi, India 8 Maximal Performance Maximal performance tests do not have faking problems because the rules of the test (make yourself look good by giving the correct answer) and the rules of the testing situation (make yourself look good by giving the correct answer) are the same. SREQOM, Delhi, India 9 Typical Performance In typical performance tests, the rules of the test are to report how one typically behaves. In personality tests, the instructions are usually like this: Please use the rating scale below to describe how accurately each statement describes you. Describe yourself as you generally are now, not as you wish to be in the future. Describe yourself as you honestly see yourself. Adapted from http://ipip.ori.org/newIPIPinstructions.htm SREQOM, Delhi, India 10 Typical Performance Thus, in a typical performance test, if one is lazy and undependable, one is asked to report on the test that one is lazy and undependable. The rules of the test (describe how you typically behave) contradict the rules of the testing situation (make yourself look good by giving the correct answer). This contradiction makes faking likely. SREQOM, Delhi, India 11 Typical Performance If one who is lazy and undependable, answers honestly, one will do poorly on the test. If one who is lazy and undependable fakes, the respondent reports that they industrious and dependable. The respondent who fakes will do well on the test. Example: McDaniel’s messy desk SREQOM, Delhi, India 12 Typical Performance Thus, one can improve one’s score on a personality test by ignoring the rules of the test (describe how you typically behave) and by following the rules of the testing situation (make yourself look good by giving the correct answer). SREQOM, Delhi, India 13 Typical Performance On typical performance tests, it is easy to know the correct responses: Dependable Agreeable Emotionally stable Thus, it is easy to fake on typical performance measures, such as personality tests, and one can dramatically improve one’s score through faking. SREQOM, Delhi, India 14 How much faking is there? Over two-thirds (68%) members of the Society for Human Resource Management (SHRM) thought that integrity tests were not useful because they were susceptible to faking. Rynes, Similarly, 70% of professional assessors believe that faking is a serious obstacle to measurement. Robie, Brown & Colbert (2002) Tuzinski & Bly (2006) These results suggest that there is frequent faking in testing situations. SREQOM, Delhi, India 15 How much faking is there? There is some emerging evidence that patterns exist regarding the proportion of fakers in a given sample. Specifically, converging evidence—though tentative—indicates that approximately 50% of a sample typically will not fake, with most of the rest being slight fakers, and a select few being extreme fakers. SREQOM, Delhi, India 16 How much faking is there? One study found that 30-50% of applicants elevated their scores compared to later honest ratings. There is also self-reported survey evidence that 65% of people say they would not fake an assessment, with 17% unsure and 17% indicating they would fake. Griffeth et al. (2005) Rees & Metcalfe (2003) None of this is encouraging for practitioners, because the presence of moderate numbers of fakers, particularly small numbers of extreme fakers, presents significant problems when attempting to select the best applicants. Komar (2008) SREQOM, Delhi, India 17 Personality tests are big business Over a third of US corporations use personality testing, and the industry takes in nearly $500 million in annual revenue. Rothstein & Goffin (2006) SREQOM, Delhi, India 18 Stop using personality tests? The fact that applicants may be highly motivated to fake in order to gain employment has raised many questions as to the usefulness of noncognitive measures. Some have even gone far enough to suggest that personality measurement should not be used for employee selection. Murphy & Dzieweczynski (2005) SREQOM, Delhi, India 19 But personality predicts Personality tests predict important work outcomes, such as job performance and training performance. Barrick, Mount & Judge, 2001; Bobko, Roth & Potosky, 1999; Hough & Furnham, 2003; Schmidt & Hunter, 1998. SREQOM, Delhi, India 20 Predict even with faking Personality measures predict work outcomes, even under conditions where faking is likely. Rothstein and Goffin state that there are “abundant grounds for optimism that the usefulness of personality testing in personnel selection is not neutralized by faking” (p. 166). SREQOM, Delhi, India 21 Faking still causes problems Even though personality measures often produce moderate predictive validities, there are a number of other ways that faking can cause problems, including: the construct validity of measures changes in the rank-order of who is selected. SREQOM, Delhi, India 22 Evidence of faking Evidence of faking The concept of faking is relatively straightforward: People engage in impression management and actively try to make themselves appear to have more desirable traits than they actually possess. However, identifying actual faking behaviors in a statistical sense has proven to be exceedingly difficult. Hough & Oswald (2005) SREQOM, Delhi, India 24 Faking shows itself in various ways Attempts to fake can show up in a number of statistical indicators: test means social desirability scales criterion-related validity actual or simulated hiring decisions construct validity. There is ample evidence that faking likely influences most of these crucial test properties SREQOM, Delhi, India 25 Social desirability as faking The construct of social desirability states that the tendency to manage the impression one maintains with others is a stable individual difference that can be measured using a traditional, Likert-style, self-report survey. Paulhus & John (1998) Social desirability items are unlikely virtues, that is, behaviors that we recognize as good but that no one usually does: I have never been angry I pick up trash off the street when I see it. I am always nice to everyone SREQOM, Delhi, India 26 Social desirability as faking Applicants for a job had higher social desirability scores than incumbents, which was interpreted as evidence that the applicants were faking. Rosse, Stecher, Miller, & Levine (1998) The initial view regarding social desirability from an applied perspective was that it could be measured in a selection context and used to correct, or adjust, the non-cognitive scores included in the test. SREQOM, Delhi, India 27 Social desirability as faking Social desirability does not function as frequently theorized. A meta-analysis showed that social desirability does not account for variance in the personalityperformance relationship. Ones, Viswesvaran and Reiss (1996) This means that knowledge of a person’s level of social desirability will not improve the measurement of that person’s standing on a non-cognitive trait. SREQOM, Delhi, India 28 Social desirability as faking Stated another way, this means that one cannot correct a person’s personality test score for social desirability to improve prediction. Applicants often fake in ways that are not likely to be detected by social desirability scores. Alliger, Lilienfeld & Mitchell (1996); Zickar & Robie, (1999) Summary: Social desirability is a poor indicator of applicant faking behavior. SREQOM, Delhi, India 29 Mean difference as faking Faking is apparent when one compares responses of groups of people who take a test under different instructions Test scores under fake-good instructions lead to higher test means than scores under honest instructions (d ≈ .6 across Big 5 personality dimensions). Viswesvaran & Ones (1999) SREQOM, Delhi, India 30 Mean difference as faking This pattern is similar when comparing actual applicants and incumbents The largest effects are found for the traditionally most predictive personality dimensions in personnel selection, conscientiousness (d = .45) and emotional stability (d = .44). Birkeland, Manson, Kisamore, Brannick & Smith (2006) Integrity test means shows the same pattern of increased means in faking conditions (d = .36 to 1.02). Alliger & Dwight (2000) SREQOM, Delhi, India 31 Mean difference as faking Thus, people have the highest means in experimental, fake-good designs and somewhat lower means in applicant settings, and these means are nearly always higher than honest/incumbent conditions. These are the most consistent findings in faking research, and they are often taken as the most persuasive evidence that faking occurs. SREQOM, Delhi, India 32 Mean difference as faking Although the mean differences between faking and honest groups permits one to conclude that faking occurs, it is of little help in identifying which applicants are faking. SREQOM, Delhi, India 33 Criterion-related validity and faking Criterion-related validity is the correlation between a test and an important work outcome, such as job performance. It is logical to assume that as applicants fake more, the test will be less able to predict important work outcomes. SREQOM, Delhi, India 34 Criterion-related validity and faking Students’ conscientiousness ratings (measured with personality and biodata instruments) were much less predictive of supervisor ratings when they completed the measures under fake-good instructions. The general pattern in applied samples is similar, as predictive validity is highest in incumbent (supposedly honest) samples, slightly lower for applicants, and drastically lower for fake-good directions. Douglas, McDaniel and Snell (1996) Hough, 1998 These findings are commonly interpreted as supporting the hypothesis that faking may lower criterion-related validity, but it often does not do so drastically. SREQOM, Delhi, India 35 Criterion-related validity and faking There are a number of caveats to this general pattern regarding predictive validity. One is situation strength: when tests are administered in ways that restrict natural variation, criterion-related validity will drop. Beatty, Cleveland & Murphy (2001) For example, if an organization clearly advertises that it only hires the most conscientious people, then applicants are more likely to fake to appear more conscientious. SREQOM, Delhi, India 36 Criterion-related validity and faking Another caveat is the number of people who fake. A Monte Carlo simulation found that the bestcase scenario for faking is an all-or-nothing proposition: validity is retained with no fakers or many fakers, but if there is a small minority of fakers present, they are likely to be rewarded, thus dragging overall test validity down. Komar, Brown, Komar & Robie (2008) SREQOM, Delhi, India 37 Criterion-related validity and faking A final caveat is that the criterion-related validity of the test as a whole may not be sensitive to changes in the rank-ordering of applicants. This assumption was tested by rank-ordering participants from two conditions (honest and fake-good), and then dividing the distribution into thirds. Komar et al., (2008) Mueller-Hanson, Heggestad and Thornton (2003) The results indicated that the top third, which included a high percentage of participants who were given faking instructions, had low validity (r = .07), while the bottom third produced high validity (r = .45). SREQOM, Delhi, India 38 Criterion-related validity and faking Thus, a criterion-related validity study may show that the test predicts job performance. However, the test may not predict well for the top scoring individuals because these are the individuals who fake. SREQOM, Delhi, India 39 Selection decisions and faking The last slide suggested that those who fake may cluster at the top of the score list. This introduces the topic of selection decisions and faking. SREQOM, Delhi, India 40 Selection decisions and faking It is a common finding that people who fake — identified by higher social desirability scores or by higher proportions of those from a faking condition— will rise to the top of the selection distribution and increase their probability of being hired. Mueller-Hanson et al. (2003); Rosse et al., (1998) This situation worsens as the selection ratio is lowered (fewer people are selected), because more of them are likely to be fakers. SREQOM, Delhi, India 41 Selection decisions and faking One study obtained applicant personality scores and then honest scores one month later. Out of 60 participants, one individual who was ranked #4 for the applicant test dropped to #52 for the honest test, indicating a large amount of faking. Griffeth, Chmielowski and Yoshita (2005) SREQOM, Delhi, India 42 Selection decisions and faking Numerous additional studies have provided similar findings, suggesting that the rank order of applicants will change considerably under different motivational and instructional conditions. This pattern is usually attributed to faking behavior, but it can also be partly explained by random or chance variation. SREQOM, Delhi, India 43 Selection decisions and faking People might score higher or lower on a second test administration due to random factors (e.g., feeling ill). Regardless, these consistent findings demand that users of non-cognitive tests cannot simply rely on a test’s predictive validity to justify its utility as a selection device. SREQOM, Delhi, India 44 Construct validity and faking The construct validity of a test concerns the internal structure and reliable relationships with other variables. Construct validity helps one to understand what the test measures and what it does not. SREQOM, Delhi, India 45 Construct validity and faking Construct validity is often overlooked in favor of criterion-related validity. However, construct validity is crucially important regarding the quality of what is measured. Construct validity can also help us understand faking. SREQOM, Delhi, India 46 Construct validity and faking Factor analysis is a statistical method to help determine constructs measured by a test. Research indicates that construct validity does indeed drop when faking is likely present. The factor structure of non-cognitive tests, especially personality, tends to degrade when applicants are compared with incumbents, as an extra factor often emerges with each item loading on that factor in addition to loading on the hypothesized factors. Zickar & Robie (1999); Cellar, Miller, Doverspike & Klawsky (1996) SREQOM, Delhi, India 47 Construct validity and faking This means that the non-cognitive constructs actually change under faking conditions, shedding some doubt as to how similar they remain to the intended, less-biased constructs. SREQOM, Delhi, India 48 Summary of Faking Studies Applicants can fake and some do fake. Evidence for faking can be seen in various types of studies. But there is no good technology for differentiating the fakers from the honest respondents. SREQOM, Delhi, India 49 Practical issues in test delivery Properties of the selection system Two key aspects of selection systems are particularly relevant to the issue of faking: Multiple-hurdle vs. compensatory systems Use and appropriate setting of cut scores SREQOM, Delhi, India 51 Properties of the selection system Multiple-hurdle vs. compensatory systems A multiple-hurdle system involves a series of stages that an applicant must pass through to ultimately be hired for the job. This usually involves setting cut scores—a line below which applicants are removed from the pool—at each step (or for each test in a selection battery). SREQOM, Delhi, India 52 Properties of the selection system Multiple-hurdle vs. compensatory systems A compensatory system, on the other hand, typically involves an overall score that is computed for each applicant, meaning that a high score for one test can compensate for a low score on another. Bott, O’Connell, Ramakrishnan & Doverspike, (2007) SREQOM, Delhi, India 53 Properties of the selection system Multiple-hurdle vs. compensatory systems A common validation procedure involves setting cut scores based on incumbent data and then applying that standard to applicants. The higher means in applicant groups could result in systematic bias in the cut scores. Basically, since there is faking in applicant samples, using the cut score determined from incumbent data will result in too many applicants passing the cut score. Bott et al. (2007) SREQOM, Delhi, India 54 Properties of the selection system Multiple-hurdle vs. compensatory systems Personality tests may best be used from a select-out versus the traditional select-in perspective. Mueller-Hanson et al. (2003) This means that the non-cognitive measure’s primary purpose would be to weed out the very undesirable candidates rather than to identify the applicants with the highest level of the trait. Don’t hire the people who state that they are lazy and undependable But know that many of the people who score well on the personality test are also lazy and undependable Thus, the goal of the personality test is to reject those who are lazy and undependable and willing to admit it. SREQOM, Delhi, India 55 Properties of the selection system Multiple-hurdle vs. compensatory systems Using a personality test or other non-cognitive measures as a “screen-out” allows many more applicants to pass the hurdle, thereby increasing the potential cost of the system. One still needs to screen the remaining applicants. Select-out may be a reasonable option under conditions of: A high selection ratio (with many positions to fill per applicant) Or low cost per test administered (such as unproctored internet testing). Practitioners have to carefully consider and justify how the setting of cut scores matches with the goals and constraints of different selection systems. SREQOM, Delhi, India 56 Situational judgment tests with knowledge instructions As noted in a previous presentation at this conference, situational judgment tests can be administered with knowledge instructions. Knowledge instructions ask the applicants to identify the best response or to rate all responses for effectiveness SREQOM, Delhi, India 57 Situational judgment tests with knowledge instructions Knowledge instructions for situational judgment tests should make them resistant to faking. McDaniel, Hartman, Whetzel, & Grubb (2007); McDaniel & Nguyen (2001); Nguyen, Biderman, & McDaniel (2005) SREQOM, Delhi, India 58 Situational judgment tests with knowledge instructions Although resistant to faking, these tests still measure non-cognitive traits, specifically: Conscientiousness Agreeableness Emotional stability McDaniel, Hartman, Whetzel, & Grubb (2007) SREQOM, Delhi, India 59 Situational judgment tests with knowledge instructions Thus, situational judgment tests hold great promise for measuring non-cognitive traits while reducing, and perhaps eliminating, faking. There are some limitations. SREQOM, Delhi, India 60 Situational judgment tests with knowledge instructions Limitations It is hard to target a situational judgment test to a particular construct It is hard to build homogenous scales With personality tests, one can easily build a scale to measure conscientiousness and another to measure agreeableness Situational judgment tests seldom have clear subtest scales SREQOM, Delhi, India 61 Faking and cognitive ability The ability to fake may be related to cognitive ability such that those who are more intelligent can fake better. The little literature on this is contradictory. If faking is dependent on cognitive ability, then faking should increase the correlation between personality and cognitive ability. SREQOM, Delhi, India 62 Faking and cognitive ability One advantage of non-cognitive tests is that they show smaller mean differences across ethnic groups. If the ethnic group differences are due to mean differences in cognitive ability, and if faking increases the correlation between personality and cognitive ability, faking should make the ethnic group differences in personality larger. SREQOM, Delhi, India 63 Faking and cultural differences Almost all faking research is done with U.S. samples. The prevalence of faking might be substantially larger in other cultures. For example, in cultures where bribery is a common business practice, one might expect more faking. SREQOM, Delhi, India 64 Potential solutions to faking Social desirability scales The literature is very clear that social desirability scales do not help in identifying fakers. Statistical corrections based on social desirability scales do not improve validity. Ellingson, Sackett and Hough (1999) Ones, Viswesvaran and Reiss (1996) Schmitt & Oswald (2006) SREQOM, Delhi, India 66 Frame of reference The rationale behind frame of reference testing is to design tests that encourage test takers to focus on their behavior in a particular setting (e.g., work). SREQOM, Delhi, India 67 Frame of reference An example of frame of reference is the addition of the phrase “at work” at the end of each items. Typical item: I am dependable Frame of reference item: I am dependable at work. SREQOM, Delhi, India 68 Frame of reference There is some evidence that frame of reference testing may increase validity. Bing, Whanger, Davison & VanHook (2004); Hunthausen, Truxillo, Bauer & Hammer (2003) However, there is no evidence that frame of reference testing reduces faking behavior. SREQOM, Delhi, India 69 Test instructions: Coaching If we want people to respond to our tests in a certain way, we can simply tell them via test instructions. Coaching is one kind of instruction, usually in the form of a vignette or example describing how to approach an item in a socially desirable way. Coaching predictably leads to faking behavior (as evidenced by higher test means) and is certainly a problem as advice to “beat” noncognitive tests circulates around the internet. SREQOM, Delhi, India 70 Test instructions: Warning Another popular strategy is to warn test takers that they will be identified and removed from the selection pool if they fake (known as a warning of identification and consequences). SREQOM, Delhi, India 71 Test instructions: Warning A meta-analysis indicated that warnings generally lower test means over standard instructions (d = .23), although there was considerable variability in the direction and magnitude of effects in the studies included. Dwight and Donovan (2003) SREQOM, Delhi, India 72 Test instructions: Warning Problems: Warnings may increase the correlation between the personality scales and cognitive ability. Vasilopoulos et al. (2005) Since one can not actually identify the fakers, it is dishonest to warn test-takers that fakers can indeed be identified. Zickar & Robie (1999) SREQOM, Delhi, India 73 Test instructions: Warning If most applicants heed the warning and do not fake, those who do fake may more easily obtain higher test scores. Thus, warnings are admittedly an imperfect method for combating faking, and more research is needed to determine the extent of their utility. SREQOM, Delhi, India 74 Get data other than self-report Personality and other non-cognitive constructs are often evaluated for selection purposes through ratings of others, including interviews and assessment centers. Approximately 35% of interviews explicitly measure non-cognitive constructs, such as personality and social skills, according to metaanalytic evidence. Huffcutt, Conway, Roth & Stone (2001) SREQOM, Delhi, India 75 Get data other than self-report Similarly, many common assessment center dimensions involve non-cognitive aspects, including communication and influencing others. Arthur, Day, McNelly & Edens (2003) SREQOM, Delhi, India 76 Get data other than self-report Little faking and impression management research has examined faking in interviews and assessment centers. However, it is logical that those who would fake in a personality inventory would also fake in an interview or an assessment center. SREQOM, Delhi, India 77 Forced-choice measures Item 1 Choose one item that is Most like you, and one item that is Least like you Most Like Me Least Like Me Get irritated easily. Have little to say. Enjoy thinking about things. Know how to comfort others. SREQOM, Delhi, India 78 Forced-choice measures Forced-choice measures differ from Likert-type scales because they take equally desirable items (desirability usually determined by independent raters) and force the respondent to choose. Forced-choice has costs: Abandoning the interval-level scale of measurement Abandoning the clearer construct scaling that Likert measures offer. SREQOM, Delhi, India 79 Forced-choice measures Whether the benefits of forced-choice formats, such as potentially reducing faking, justify these costs is questionable. The effect of forced-choice on test means is unclear, as some studies show higher means of forced-choice compared with Likert measures and others indicate lower means. Heggestad, Morrison, Reeve & McCloy, (2006); Vasilopoulos, Cucina, Dyomina, Morewitz & Reilly (2006) SREQOM, Delhi, India 80 Forced-choice measures Research on the effect of forced-choice on selection decisions used items in both a forced-choice and Likert format under pseudo-applicant instructions (pretend you are applying for a job). Heggestad et al. (2006) SREQOM, Delhi, India 81 Forced-choice measures They compared the rank-order produced by both tests to an honest condition using a different personality measure. Results showed few differences in the rank-orders between the measures, offering preliminary evidence that forced-choice does not improve selection decisions. In summary, forced-choice tests do not necessarily reduce faking, and the statistical and conceptual limitations associated with their use probably does not justify replacing traditional non-cognitive test formats. SREQOM, Delhi, India 82 Recommendations for Practice Avoid corrections Little evidence exists that social desirability scales or lie scales can identify faking. Many tests include lie scales with instructions about how to correct scores based on lie scales, with the justification that corrections will improve test validity. Rothstein & Goffin (2006) There is no evidence to support this assertion, rendering corrections a largely indefensible strategy. SREQOM, Delhi, India 84 Specify how non-cognitive measures fit the goals of the selection system Given the consistent effect of faking on test means, faking will affect cut scores and who is selected in both compensatory and multiple hurdle systems. Bott et al. (2007) Cut scores may have to be adjusted upward if they are set based on incumbent scores. SREQOM, Delhi, India 85 Specify how non-cognitive measures fit the goals of the selection system The select-out strategy is an option. Reject applicants who are willing to admit that they are lazy and undependable Screen the remaining applicants with a maximal performance measure that is fakingfree or faking-resistant. Select-out is a good strategy when the selection ratio is high (i.e., you will hire most of those who apply). SREQOM, Delhi, India 86 Recognize that criterion-related validity may say little about faking It is common to have a useful level of validity for the test, when known faking is present. However, the fakers are represented in greater proportions at the high end of the test scores. The validity may be much worse among these applicants. SREQOM, Delhi, India 87 Manipulate the motivation of the applicants If applicants are given information about the job to which they are applying, they can fake their scores toward that stereotype. Mahar, Cologon & Duck (1995) On the other hand, if applicants are informed about the potential consequences of poor fit, which faking could realistically lead to during the placement phase, they may be motivated to respond more honestly, and initial research indicates that this may be true. Nordlund & Snell (2006) SREQOM, Delhi, India 88 Conclusion Non-cognitive tests can be faked. Non-cognitive tests are faked. There is no method to eliminate faking. Consider using non-cognitive tests as select-out screens SREQOM, Delhi, India 89 Conclusion Use maximal performance tests (cognitive ability and job knowledge) to screen those who remain. Consider measuring non-cognitive traits with faking-resistant situational judgment tests with knowledge instructions. SREQOM, Delhi, India 90 References References are in the book chapter. SREQOM, Delhi, India 91 Thank you. Questions?? SREQOM, Delhi, India 92